Section 06 · How do I grade the outputs?
Scoring
Code, models, or humans: what it costs to get a verdict you can trust.
The three grader families
Every verdict in your suite comes from one of three grader families, listed here in order of cost.
Deterministic graders are code. They come in two flavors: heuristics (regex matches, exact-match checks, execution checks like "does the SQL run", "does the JSON parse", "did the agent actually call the refund tool") and statistics, distribution checks that catch population-level drift: response lengths suddenly doubling, refusal rates creeping up, tool-call counts collapsing. Deterministic graders are fast, nearly free, debuggable line by line, and perfectly reproducible. Their weakness is literal-mindedness: they grade exactly what you wrote, which is rarely exactly what you meant.
Probabilistic graders are models grading models. The two main species are trained classifiers, binary (toxic or not, on-topic or not) and multi-class (which of your seven failure categories does this trace fall into), and LLM-as-a-judge, where you hand a model the trace plus a rubric and ask for a verdict. They handle nuance and scale to anything you can describe in a prompt, but they cost real money per trace and need human calibration before you can trust a number they emit.
Human graders are the gold standard and the bottleneck. Expensive, slow, and (the part people forget) not interchangeable. A crowdworker can grade tone; only someone with domain context can tell you whether the agent's tax answer is subtly wrong. The wrong human is just a very expensive probabilistic grader.
Flowchart coming soon
Which grader does this judge need?
When each grader wins
Picking a grader is mostly a cost question: use the cheapest family that can actually see the property you care about.
| Family | Cost per trace | Wins when | Watch out for |
|---|---|---|---|
| Deterministic | Effectively free | The property is checkable in code: format, execution, exact values, tool calls | Overfitting to surface form; going stale when output formats change |
| Probabilistic | Cents | The property needs judgment but follows a rubric you can write down | Drift, bias toward verbose answers, uncalibrated confidence |
| Human | Dollars and days | Calibrating the other two, novel failure modes, final arbitration | The wrong humans, fatigue, disagreement between graders |
Lean deterministic whenever possible. A code-based check is the easiest grader to produce, the easiest to verify, and the only one that doesn't need a calibration loop of its own. If you can phrase a property as an assertion, do it. Save the LLM judge for properties code genuinely can't reach.
In practice, every real suite blends all three. Deterministic checks run on every trace in CI. LLM judges cover the judgment calls: tone, faithfulness, instruction-following. Humans calibrate the judges and spot-audit the results. And failure modes tend to migrate down the cost ladder over their lifetime: a human notices the problem, an LLM judge formalizes it, and once you understand the failure well enough, you compile it into fifty lines of Python and run it for free forever.
Information asymmetry beats capability asymmetry
When an LLM judge underperforms, the default instinct is to swap in a bigger model. Most of the time that's the wrong lever.
Judge quality comes from information asymmetry more than capability asymmetry. A judge that sees the golden answer, the rubric, the full trace, and the tool-call ground truth will beat a smarter judge flying blind. Concretely: a same-class model judging with a reference answer in hand usually beats a frontier model judging reference-free, and costs a fraction as much. Checking a response against a known-good answer is a far easier task than deciding from scratch whether a response is good.
This is also why reference-free judges are the hardest family to calibrate. A judge with no golden answer, no execution result, no ground truth, just a rubric and vibes, has to be roughly as good at the task as the agent it's grading. At that point you haven't built an eval; you've built a second agent and declared it the referee.
The same principle shows up elsewhere in ML: on-policy distillation works because the teacher grades the student's own outputs token by token: dense feedback on exactly the behavior being trained, and grading is a much easier job than generating. Supervision quality tracks what the supervisor gets to see.
So before reaching for a bigger judge, inventory the privileged context you could hand the cheaper one: the golden answer from your dataset, the execution result, the actual tool responses, the user's eventual resolution. Each one converts an open-ended judgment into a comparison. And comparisons are what models grade well.
Eval scores are samples, not truths
Run the same eval suite twice and you'll get two different numbers. Sampling temperature, nondeterministic inference backends, flaky judges, which examples you drew. All of it injects noise. The problem isn't the noise; it's reading a single pass rate as the truth instead of as one sample from a distribution.
Anthropic's "A statistical approach to model evaluations" lays out the fix, and it's ordinary statistics applied with unusual discipline. Treat each eval question as a draw from an underlying distribution, compute the standard error, and put error bars on every score you report. A dashboard that says 84% is hiding information; one that says 84% plus or minus 4 tells you what you can actually conclude.
When comparing two variants (model A vs model B, prompt v3 vs v4) use paired comparisons: run both on the same questions and analyze the per-question differences. The bulk of the variance, mostly question difficulty, is shared between variants, so pairing cancels it, and effects that look like noise in unpaired numbers become clearly significant.
The companion concept is the minimum detectable effect (MDE): the smallest real improvement your eval can reliably distinguish from noise at your sample size. If your suite has 100 examples and the error bars span 8 points, a 2-point "win" is unreadable: your instrument cannot see effects that small. Compute the MDE before sizing your dataset, not after: decide the smallest regression you need to catch, then work backwards to how many examples that requires.
Pass@K vs Pass^K
Two metrics, one keystroke apart, measuring opposite things.
Pass@K asks: out of K attempts, did at least one pass? It rises as K grows, and it measures the capability ceiling: can the system do this at all, given retries? It's the right lens for workflows with selection built in: code generation with reranking, best-of-n sampling, anything where a human picks the winner.
Pass^K asks: did all K attempts pass? It falls as K grows, and it measures reliability. The arithmetic is brutal. A step that passes 90% of the time gives you Pass@3 of 99.9%, and Pass^10 of about 35%. The same system is simultaneously "nearly perfect" and "fails most users," depending on which metric you read.
Agents in production need Pass^K thinking. A hotel-booking agent that's right 95% of the time per booking still wrecks one trip in twenty, and a user who books weekly should expect a failure inside six months. Multi-step workflows compound the same way internally: ten sequential steps at 95% each multiply out to roughly 60% end-to-end, before anyone retries anything.
Report both. Pass@K tells you whether the capability exists somewhere in the distribution, whether more prompting, tools, or selection could harvest it. Pass^K tells you whether to ship. Demos run on Pass@K; production runs on Pass^K, and confusing the two is how a flawless demo becomes a support-ticket queue.
Pin everything
An eval you can't reproduce is a rumor with a decimal point. Before you trust any score delta, pin every input that could move the number:
- Model versions. Exact snapshot ids, never floating aliases. Providers swap what "latest" points to, and your three-point regression turns out to be their upgrade.
- Sampling parameters. Temperature, top-p, and seeds where the API offers them. A judge at temperature 1.0 is a different judge every run.
- Prompts. Agent prompts and judge prompts, version-controlled together. A one-line rubric tweak shifts pass rates as much as a model swap.
- Datasets. Version your golden set. Quietly appending ten hard examples moves every historical number and nobody remembers why.
- Environments. Sandbox images, API fixtures, mocked tool responses, anything the agent touches during the run.
The judge deserves extra paranoia. Upgrade the judge model and every historical score silently changes meaning, because the grader moved while the work stayed still. When you do upgrade, rerun the new judge over frozen historical traces and re-baseline before comparing anything across the boundary.
The test is simple: rerun last month's eval and get last month's number, within error bars. If you can't, you can no longer distinguish model regressions from harness drift, and your trend charts are archaeology rather than measurement.
If you remember nothing else
- 01 Three grader families in cost order: code, models, humans. Use the cheapest one that can see the property.
- 02 Lean deterministic wherever a property can be phrased as an assertion; real suites blend all three.
- 03 Judge quality is information asymmetry, not capability asymmetry. Hand the judge the answer key before buying a bigger model.
- 04 Eval scores are samples, not truths. Error bars and paired comparisons before celebrating, and know your minimum detectable effect.
- 05 Pass@K measures whether it can; Pass^K measures whether it will. Production agents live and die on Pass^K.
- 06 Pin model versions, temperature, prompts, and datasets. An eval you cannot reproduce is a rumor.