Section 01 · What are you measuring?
Evaluations
Why AI products can't be tested like software, and what to do about it.
Why you can't just write tests
Traditional software is deterministic: click the same buttons in the same order and you land in the same place, every time. That property is what makes testing work. Assert that input X produces output Y, run it in CI, ship with some confidence.
AI products break this contract twice. The inputs are unbounded: users will type things you never imagined, in languages you didn't plan for, with goals you didn't design around. And the outputs are unbounded too: the same prompt can produce a different response on every run, and "different" sometimes means "subtly wrong in a way you haven't seen before." There is no assertEquals for "the agent handled that reasonably."
Evals are how you get the testing contract back. An eval takes a real or realistic input, runs your system on it, and applies a judge, a piece of code or a model, that decides whether the output was acceptable. String enough of those together and you have something resembling a test suite for a system you can't step through with a debugger.
That's the whole pitch. Evals are the window into the black box. Without them, your quality signal is whatever users happen to complain about, which is late, noisy, and heavily weighted toward your angriest one percent.
Define 'good' before you measure anything
The trap most teams fall into is starting with metrics. Someone stands up a dashboard showing "average helpfulness: 4.1 / 5" and it feels like progress. But a number without a definition of good attached is decoration. When it drops, nobody knows what broke; when it rises, nobody knows what to keep doing.
The useful unit is the failure mode: a specific, repeatable way your product goes wrong. "The hotel agent quotes the refundable-rate cancellation policy on non-refundable bookings." "The SQL agent joins on the wrong key whenever a table has two date columns." Each of those is concrete enough to reproduce, count, and hand to an engineer with a straight face.
This reframes the entire job. You are not computing a quality score; you are building a catalog of the ways your product fails, ranked by how much each one hurts. Progress means crossing entries off the catalog and adding judges that keep them from sneaking back in.
Failure modes come from reading traces. Sit down with real transcripts of your agent working, note every moment that makes you wince, and group the winces. Ten traces in, you'll have themes. Fifty traces in, you'll have a roadmap. No metric will ever hand you that; you have to go look.
The six properties you can measure
Nearly everything worth knowing about a model or agent falls into one of six buckets:
- Model behavior: does the model conduct itself the way you asked: tone, format, language, instruction-following. Example eval: a support bot must reply in the user's language and under 150 words; a cheap judge checks both on every trace.
- Reasoning: can it work through a problem instead of pattern-matching to an answer. Example eval: given a refund request that matches two recent orders, does the agent ask which order before issuing anything?
- Agentic behavior: does the agent pick the right tools, in the right order, and recover when one fails. Example eval: the hotel agent must call the availability check before it ever calls confirm-booking.
- Knowledge: does it know, or correctly retrieve, the facts the task needs. Example eval: quiz a RAG bot with policy questions that have known answers and grade against ground truth.
- Safety: does it refuse what it should and stay inside its lane. Example eval: a health-insurance bot must route medication-dosage questions to a pharmacist rather than answering them.
- CIA: the security triad. Confidentiality: does it leak user A's data to user B? Integrity: can a prompt injection buried in a retrieved document make it take actions nobody asked for? Availability: does it spiral into tool-call loops under adversarial input? Example eval: seed retrieved pages with injection attempts and check whether the agent obeys them.
Most products need real coverage in three or four of these buckets, not all six. Knowing which buckets matter for yours is half of defining what good looks like.
Offline and online evals
Evals split into two families based on when they run, and you need both.
Offline evals run before a change ships. You keep a fixed dataset of inputs (real traces you've collected, synthetic edge cases, regression cases from past incidents) and every candidate change gets run against it and scored by your judges. Swap the model, rewrite the system prompt, tweak a tool definition: the offline suite tells you whether things got better or worse while the stakes are still zero.
Online evals run on production traffic. Judges score live traces, or a sample of them, and alert you when a failure mode starts climbing. Production is where users do things no offline dataset anticipated, and online evals are how you catch a regression before your support queue does.
The two feed each other in a loop. Online evals surface failure modes you didn't predict; those traces get pulled into the offline dataset; the offline suite then guards against that failure forever. A team running only offline evals is confident and blind. A team running only online evals finds every regression after their users do. Offline tells you whether to ship, online tells you what to fix next. Neither answers both questions.
Why this guide centers on agents
Of the six buckets, agentic behavior gets the most airtime in this guide, because that's what most teams are actually shipping. The interesting AI products right now aren't single completions. They're agents: systems that take a goal, make a plan, call tools, read the results, and keep going. The hotel-booking agent, the SQL agent, the support bot that can actually issue the refund instead of apologizing about it.
Agents are also the hardest thing on the list to evaluate. A chat completion gives you one input and one output to grade. An agent gives you a trace: a multi-step transcript of model calls, tool calls, and intermediate decisions, where things can go wrong at every step, and where an early wrong turn can still produce a confident, plausible-looking final answer. The agent that books the wrong hotel politely is a worse failure than the one that errors out, and a final-answer check alone won't catch it.
The good news is that the hard case contains all the easy ones. A single-turn classifier is just an agent with one step, so everything ahead (datasets, judges, calibration, reporting) transfers down. The rest of the spokes assume you're working with agent traces, because that's both the common case and the one where sloppy evals hurt the most.
If you remember nothing else
- 01 AI products are black boxes with unbounded inputs and outputs. Evals are the window you build into them.
- 02 Define good and bad before picking metrics. A score without a definition of good is decoration.
- 03 Failure modes are the unit of progress: specific, repeatable, countable ways your product goes wrong.
- 04 Six buckets cover what you can measure: model behavior, reasoning, agentic behavior, knowledge, safety, and CIA.
- 05 Offline evals catch the regressions you predicted before they ship; online evals catch the ones you didn't.
- 06 Agentic behavior is the common case and the hard one. If you can evaluate a ten-step trace, one step is easy.
Further reading