How I Actually Evaluate LLMs for Production
Every few months I go through the same ritual: a new model drops, the benchmarks look impressive, someone on the team sends me a link with "have you seen this?", and I have to decide whether it's worth running through a proper evaluation or whether I can just nod and move on. I've gotten reasonably good at the ritual. Here's how it actually goes.
I want to be upfront that this isn't a rubric I hand to clients as a formal document. It's more like a set of instincts I've developed from doing this enough times that I've burned myself in predictable ways. Some of what follows is process. Some of it is scar tissue.
The leaderboard problem
LLM benchmarks are getting gamed in ways that make them increasingly useless for production decision-making.
Post thisLLM benchmarks are getting gamed in ways that make them increasingly useless for production decision-making. I don't mean that maliciously — it's partly just a natural consequence of optimizing for measurable targets. But I've had models that looked spectacular on MMLU and HELM crumble embarrassingly when we put them on our actual workflows: mixed code, policy language, and free-form reasoning in the same prompt. Benchmark performance doesn't transfer cleanly to domain-specific production tasks, and by now I expect the gap rather than being surprised by it.
So the first thing I do when evaluating a new model is set aside the leaderboard position entirely. I look at it maybe to orient myself, but then I close the tab and go build tests on our actual data.
Does it actually fit our task?
This sounds obvious. It's not, apparently, because I keep seeing teams skip it.
We have five or six representative workflows that I use as a sanity check for any new model. They're drawn from real production traffic — anonymized — and they cover the distribution of things we actually ask models to do: document extraction with structured output, multi-step reasoning over policy language, code generation with domain-specific constraints, and a few nastier edge cases that I've collected from incidents. If a model can handle those workflows competently, zero-shot or with minimal prompting, it earns a spot in the deeper evaluation. If it can't, I stop there.
I'm particularly interested in whether the model can maintain reasoning coherence when the prompt mixes modalities — code plus natural language plus structured data in one request. A lot of models do fine on clean single-modality inputs and fall apart on the messy real-world stuff. The messy real-world stuff is what we actually have.
Latency — the thing people underweight
I've watched teams make model selections based entirely on quality benchmarks and then get blindsided by latency in production. A model that takes 4 seconds to respond under normal load might be completely unusable at peak traffic, or when retrieval adds a large context window, or when the GPU pool is cold on startup. These aren't edge cases. They happen regularly.
I want to see P95 latency under burst conditions, not median latency on a clean benchmark prompt. I want to know what happens when you hit the model with a 100k-token context. I want to know if it degrades gracefully or falls off a cliff. There's a class of models I've tested where performance is fine at low load and catastrophic at anything resembling production volume — those models tend to have impressive quality numbers and get selected anyway by teams who didn't test under realistic conditions.
I'm probably overly paranoid about this. But I've been in the 3am incident where a model selection decision from six months earlier was the root cause, and that experience sticks with you.
The safety and auditability question is harder than it sounds
For regulated domains — financial services, healthcare-adjacent work, compliance tooling — you can't just pick the model with the best quality score. You need to understand what the guardrails actually do and whether they're configurable. Overzealous refusal policies that block legitimate internal workflows are a real operational problem. I've seen teams build shadow systems to work around overly restrictive models because modifying the guardrails wasn't possible. That's not a solution, it's a liability.
I need to be able to inspect why a model refused something. I need access to raw logs. I need red-team reports from the vendor that are specific enough to be actionable, not marketing PDFs. This is an area where vendors vary enormously and where the gap between what's promised and what's actually available tends to be largest.
I might be wrong about how much this matters for some use cases. If you're building something genuinely low-stakes, maybe the auditability requirements are lighter. But in my experience, "low-stakes" has a way of gradually becoming "actually pretty important" as a product matures, and retrofitting audit infrastructure is painful.
Context handling and retrieval
Context window size is one of those numbers vendors advertise prominently and that means less than it seems. A model might technically support 128k tokens but produce noticeably worse outputs when the context is full versus when it's under 30k. I've seen this repeatedly — performance that degrades non-linearly as context fills up, with no indication from the model that it's struggling.
When I'm evaluating context handling, I test with genuinely long inputs — the kind of documents we'd actually feed through in production — and I compare outputs at different context lengths. I also want to understand the truncation behavior: when the context is too long, does the model tell you, silently truncate, or do something weirder? Silent truncation with no acknowledgment has caused real incidents on projects I've been involved with.
What customization actually looks like
Fine-tuning support varies wildly. Some vendors offer it as a first-class API; others make it a bespoke enterprise negotiation. I care about: can we adapt the model to our domain without a six-month vendor engagement? Can we version our fine-tuned models? When the base model gets updated, does our fine-tune break, and how do we find out?
The last question is more important than it sounds. I've had the experience of waking up to a silent model update — vendor rotated weights without announcement — and watching accuracy on our compliance workflows drop noticeably before anyone noticed. Monthly regression suites running against production transcripts are now a standing practice for this reason.
The model council
About a year ago, on a project that was burning through model costs faster than expected, a colleague suggested we set up a regular meeting — product, infra, legal, and whoever was running the AI systems — to review model performance, cost, and any incidents. It sounded like overhead at the time. It turned out to be one of the better process decisions we made.
We called it the model council, half-jokingly. The format was simple: fifteen minutes every two weeks, structured around three questions. Is the model performing within expected parameters? Has anything changed — vendor updates, new incidents, cost anomalies? Is there a pending decision — upgrade, rollback, fine-tuning cycle — that needs a call? The meeting forced us to look at the telemetry regularly instead of only when something broke. Several small degradations we caught in that meeting would have become larger problems if we'd waited for an incident to surface them.
The "legal" piece sounds bureaucratic, but it ended up being genuinely useful. There were contract clauses and data retention questions that the technical team wouldn't have thought to ask about, and catching them in a regular review was much better than catching them during a procurement audit.
Vendor diligence — the commercial side
Technical performance doesn't matter if the commercial relationship falls apart. The things I check:
Data governance: SOC 2, regional tenancy, and explicit retention windows for inference and fine-tuning data. Not assurances — documented policies I can point to. I've had vendors tell me verbally that data isn't used for training and then produce contracts with language that said otherwise. Read the contracts.
Incident transparency: what does the vendor's outage communication look like historically? Not their SLA document — their actual incident response record. Some vendors have excellent transparency; some go dark during incidents and post a postmortem a week later that doesn't tell you much.
Exit strategy: can we export fine-tuned model derivatives? Are there contract clauses that make it expensive or complicated to migrate to a different vendor or an open-source alternative? I've seen lock-in provisions buried in contracts that became very relevant when a vendor changed pricing. Plan the exit before you sign.
What I actually look for day-to-day
Once a model is in production, I keep latency, refusal rate, hallucination tags, and spend in the same observability dashboards as the rest of the backend. Refusal rate and spend together tell you a lot — a climbing refusal rate often means prompts are drifting in ways that trigger guardrails, which is worth investigating. A climbing cost without a proportional climbing usage usually means something has changed in context length or token efficiency.
Shadow deployments, where anonymized production traffic replays through a candidate model, are the best way I've found to quantify the real switching cost before you commit to a migration. They're more work to set up than a benchmark, but the signal is much more relevant to your actual use case.
None of this is exotic. It's mostly just being systematic about things that are easy to skip when you're moving fast. The models that looked like clear choices on benchmarks haven't always been the right choices in production. The evaluation process I've described above has caught that gap enough times that I've stopped treating it as optional.