Section 08 · How do I actually build the judge?
Eval Design
From measurement unit to eval card: the anatomy of a judge that works.
What exactly gets judged?
Every eval starts with the same question: what object does the judge actually look at? One span: a single tool call, one generated SQL query? A whole trace: the agent's complete attempt at a task? A session spanning several tasks? A diff: the change an agent made to a codebase? This is the measurement unit: the grain decision from the grains spoke, made concrete.
Getting it wrong is the most common way evals fail silently. You write down "the agent books hotels correctly," then build a judge that only reads the final chat message, and miss that the agent booked the right hotel for the wrong dates, because the confirmation message looked fine. Property and unit have to match: "writes executable SQL" is a span-level property, so judge the span. "Answered the analytics question" is trace-level. "Resolved the support issue without the user re-contacting" is session-level, and no amount of squinting at a single trace will measure it.
The unit determines everything downstream: what your input cases look like, what ground truth means, and what the grader is allowed to see. That's why it's the first box in the pipeline. Change it later and you're not adjusting the eval, you're rebuilding it.
Where the cases come from
Cases come from three places, and a healthy dataset usually mixes all three.
- Real traces mined from production. Maximum realism, zero authoring cost, but biased toward what users already do, which means biased toward the happy path. Users quietly route around the things your agent is bad at, so the failures you most need to test are underrepresented in prod data.
- Hand-written cases. Expensive per case, but the only way to guarantee a specific scenario exists: the sold-out hotel, the ambiguous date ("next Friday" said on a Thursday), the user who changes their mind mid-booking.
- Synthetic variations. Take a real or hand-written case and perturb it: shift the dates, change the party size, swap the city for one with no availability. Cheap multiplication of coverage, as long as you spot-check that the perturbations stay realistic.
The design principle that matters more than the source: every case should stress the property under test, and ideally nothing else. If the eval is "agent checks availability before booking," every case needs an availability wrinkle. Cases that also stress tone, currency conversion, and a flaky retrieval step don't test more; they make failures unattributable, which kills actionability.
And budget deliberately for edge cases. Sample uniformly from production and you'll get 95% happy path, and the eval will confidently report that your agent handles the easy stuff. You already knew that.
Ground truth: golden answers, rubrics, or neither
Ground truth is whatever the grader compares the output against. Three strategies, in descending order of preference.
Golden answers: the exact expected result. The correct query output, the specific booking record, the right refund amount. Use these whenever the task has one verifiable right answer; nothing is cheaper to grade against or harder to argue with.
Rubrics: written criteria for when many outputs are acceptable. A support reply can be phrased a hundred ways, but it must state the refund deadline, must not promise compensation, and must not invent policy. The rubric pins the must-haves and leaves the phrasing free.
Reference-free: the judge grades the output with nothing to compare it to. This is a last resort, and the scoring spoke explains why: a good judge needs an information advantage over the model being graded. A reference-free judge knows nothing the policy model didn't, so it grades what it can see: fluency, confidence, formatting. That's how you end up with a judge that loves wrong answers delivered with conviction.
Two questions teams skip: who writes the ground truth, and who keeps it alive. The answer to the first is the person who would file the bug, a domain expert, not whoever had a free afternoon. The answer to the second is nobody, unless you make it someone: golden answers rot every time the product changes (new cancellation policy, new tool, new schema). Version the ground truth with the dataset and put a name next to it.
Designing the grader
The scoring spoke covers which grader to pick: code, model, or human. This is about setting up whichever one you picked so it actually works.
Deterministic graders: make the check sharp and narrow. One assertion per check: the SQL executes, the JSON parses against the schema, the booking row exists with the right dates. The temptation is to pile assertions into one grader while you're in there. Resist it: ten assertions in one check is a composite score with extra steps, and when it fails you're back to grepping logs to find out why.
Human graders: brief them like new hires, not like Mechanical Turkers. That means a written rubric, three or four worked examples of passes and fails with the reasoning spelled out, and an explicit "unsure" option. Without the escape hatch, ambiguous cases get coin-flipped and your agreement numbers look worse than your rubric deserves.
LLM-as-a-judge: three rules.
- The rubric lives in the prompt. If the criteria are in your head and the prompt says "rate the quality," the judge invents its own criteria, differently on every run.
- Give the judge privileged context: the golden answer, the tool logs, the policy doc. A judge that sees only what the model saw is grading blind.
- Force a binary verdict with a required reason. Pass or fail, plus why. Scales from 1 to 10 feel more informative but mostly add noise, and the written reason is what makes a verdict falsifiable when you audit the judge later.
Outcome or process?
There are two things you can grade: where the agent ended up, or how it got there.
Outcome-based grading judges the final state. Did the correct booking land in the system, with the right dates and the right room? Did the query return the right rows? How the agent got there is its own business.
Process-based grading judges the path. Did the agent check availability before booking? Did it confirm the cancellation policy with the user before charging the card?
| Outcome | Process | |
|---|---|---|
| Judges | the final state | the steps taken |
| Survives | prompt rewrites, model swaps, new strategies | very little: it pins the current implementation |
| Catches | failures that show up in the result | failures the result hides |
| A failure means | something broke, somewhere | this specific step broke |
Outcome grading is more robust: the agent can reorganize its entire approach and the eval still measures the right thing. Process grading is more actionable: when it fires, you know exactly which step to fix. The cost is brittleness: pin the path too tightly and the eval starts punishing legitimate improvements. And once step-compliance becomes the metric you hillclimb, Goodhart shows up: the agent gets optimized to perform the prescribed steps whether or not they still serve the outcome.
Default to outcome grading wherever the final state is checkable. Add process checks for the things outcomes can't see: safety-critical steps (did it confirm before charging the card?) and lucky guesses: an agent that skips the availability check and gets lucky passes every outcome eval right up until the day it doesn't.
Environment vs harness
Two pieces of machinery get conflated constantly, and the conflation produces evals that lie.
The environment is the world the agent acts in: the tools it can call, the sandbox it runs in, the data it operates on: the fake hotel inventory, the seeded database, the mock payment API. The harness is everything wrapped around that world: the code that spins up an episode, injects the task, enforces timeouts, collects the trace, and hands it to the grader.
The environment should be as realistic as you can afford: same tool definitions as production, data with the same mess in it. The harness should be boring: deterministic, fast, and invisible in the results.
Conflate them and you get evals that test the harness instead of the agent. The sandbox is flaky, the seed data has no hotels in the requested city, a tool times out, and the run gets recorded as the agent failing. Now your pass rate measures infrastructure weather. Anthropic's agent-eval guide flags the same trap: infrastructure flakiness produces correlated failures that look like agent regressions, and a 0% pass rate usually means a broken task, not an incapable agent. Track environment and harness problems separately from real failures: a task the world made impossible tells you nothing about the agent.
The practical test before reading any numbers: could a perfect agent pass every case? If not, fix the environment first.
Pressure-test it, then write the card
A judge you haven't attacked is a judge you're trusting on vibes. Before any eval gates a release, pressure-test it:
- Feed it known-good cases: traces a domain expert already blessed. Every false failure here is calibration debt you'll pay later in ignored alerts.
- Feed it known-bad cases: real failures from production, plus hand-built ones. A judge that has never seen a true failure has an unknown catch rate, which is the same as no catch rate.
- Try to fool it. A long, confident, well-formatted answer that's wrong. A terse correct one. A response that quotes the rubric back at the judge. If any of these flips the verdict, you've found what the judge actually measures.
- Hunt for shortcut features. Correlate verdicts with surface features: length, politeness, presence of a summary section. If verbosity predicts passing, you've built a verbosity detector with a quality-shaped name.
Then write the eval card, a one-pager that ships with the eval, the suite's equivalent of a model card:
- What property it measures, and at what grain
- Which goal it ties back to
- Grader type and where the ground truth came from
- Dataset provenance and size
- Known blind spots: everything the pressure test exposed
- An owner, with a name
The card exists for the person six months from now staring at a pass-rate dip, deciding whether to trust the number or the model. Without it, that decision is archaeology. With it, it's a two-minute read.
A filled-in eval card: the one-page design spec for a fabricated-citation judge, showing its position, dataset, privileged context, FAST checklist, calibration stats, and changelog.
Fabricated citation check
judge.cite_fabrication · owner: evals@acme
Failure mode
Response cites a URL or paper that does not resolve, or that does not support the claim it's attached to.
Motivating traces
Unit under test
End-to-end response, post-retrieval (full RAG pipeline)
Privileged context
Retrieved documents + live URL resolution results
Dataset
142 real traces + 60 perturbed variants, incl. 38 negatives (correct citations that look unusual)
Pass criteria
Fail on ≥1 fabricated citation; CI gate requires 0 fails on golden set, evidence span quoted per verdict
Expected base rate
~3% of production traces fail (last measured May 2026)
Recalibration triggers
Model swap · judge prompt edit · base-rate drift >2x · quarterly review
✓Falsifiable
binary verdict per trace
✓Actionable
quotes offending span
✓Specific
one property only
✓Tractable
≈$0.004 per trace
Calibration · n=120 labels · last run 2026-05-28
Agreement
94%
Cohen's κ
0.81
TPR
0.92
FPR
0.04
v1.3 tightened evidence quoting after FPR spike · v1.2 added 38 negatives · v1.0 initial
If you remember nothing else
- 01 Pick the measurement unit first. Property and grain have to match, and everything downstream depends on it.
- 02 Every input case should stress the property under test. Production traces alone over-represent the happy path.
- 03 Golden answers beat rubrics beat reference-free. A judge with no information advantage grades on style.
- 04 LLM judges get the rubric in the prompt, privileged context, and a forced binary verdict with a reason.
- 05 Default to outcome grading for robustness; add process checks for what the final state cannot show.
- 06 Attack every judge before trusting it, then ship it with an eval card: property, grain, grader, blind spots, owner.
Further reading