Section 07 · Which judges deserve to exist?
Building Judges
Most judge candidates shouldn't make the cut. Here's the filter.
Two ways to find a judge
Every judge candidate comes from one of two places: your data or your spec.
Bottom-up identification starts from real traces. You read what your product actually did, notice the ways it failed, group those failures, and propose judges for the groups that matter. The judge exists because the failure exists. You have the receipts.
Top-down identification starts from intent. You read the PRD, the system prompt, the tool definitions, the policy doc, and ask: what is this product supposed to do, and what would it look like to do it wrong? The judge exists because the requirement exists, even if nothing has violated it yet.
Neither direction is sufficient on its own. Pure bottom-up means your eval suite is a museum of last week's incidents; you're always one step behind whatever production surprises you next. Pure top-down means you're testing the brochure instead of the product, beautifully covering requirements while users hit failure modes nobody wrote down. Bottom-up keeps you honest. Top-down keeps you ahead.
In practice the two feed each other. Error analysis surfaces a failure cluster, which makes you realize the spec was silent on something, which makes you write the requirement down, which generates a top-down judge for the next feature before it ever ships. The direction a candidate came from doesn't decide whether it deserves to exist; that is a separate filter we'll come to last. But it usually shapes what kind of judge it becomes, and that is the next thing to pin down.
Bottom-up: error analysis
The bottom-up process has a name, error analysis, and an evangelist. Hamel Husain has spent years beating one drum: look at your data. Not dashboards about your data. The actual traces.
The mechanics are unglamorous. Pull a sample of traces. Fifty to a hundred is a fine start. Read each one and write a short, open-ended note about anything wrong: "agent promised a refund it can't issue", "re-asked for the order number it already had", "quoted a price without checking availability". No taxonomy yet; just notes.
Then cluster. Similar notes become named failure modes, and suddenly you have counts. This is where intuition gets corrected: the failure everyone remembers from the angry customer email is often rare, while some quiet failure (the support bot silently dropping half of multi-part questions) turns out to be in a fifth of your traces. Counting beats vibes.
Only the clusters that matter become judge candidates. A cluster with three occurrences and no real consequence doesn't graduate. A cluster that's frequent, severe, or both becomes a real candidate, one that still has to earn its slot before it ships.
The most common eval failure isn't a badly built judge. It's a suite built without this step at all. Teams write judges for the failures they imagine instead of the ones they have, then wonder why the suite is green while users churn. Reading traces is the cheapest, highest-leverage activity in all of evals, and it stays that way forever: error analysis isn't a setup phase, it's a habit.
Top-down: judges from the spec
Top-down judges are derived from what the product is supposed to do, and they're how you eval something with no production data: a new product, a new feature, a new tool the agent just grew.
The raw material is anything that encodes intent: the PRD, the system prompt, tool definitions, escalation policies, compliance rules. Walk through each commitment and turn it into a falsifiable check. A hotel-booking agent whose spec says "always confirm dates and party size before booking" and "never quote a price without checking live availability" has just handed you two judge candidates, and the product hasn't served a single user yet.
Top-down judges have a distinct character. They're falsifiable from the spec: you don't need an observed failure to define what failure looks like, because the requirement itself defines it. That makes them your early-warning system: they catch the violation the first time it happens, instead of the fifty-first time when error analysis finally surfaces the cluster.
Their weakness is the mirror image. Because they're born from imagination rather than evidence, some of them will guard against failures that never materialize. That's not free: every judge costs build time, run cost, and calibration attention. This is why variance is one of the gates a candidate has to clear, and why the question doesn't stop there: a top-down judge that has never once failed in three months of real traffic is a candidate for retirement or a tighter rubric, not a trophy.
So run both motions: spec-driven judges before launch, error analysis forever after. Every candidate from either direction still has to earn its place, but first it needs a label: what kind of judge is it?
Regression judges and capability judges
Every judge worth building lands in one of two classes.
A regression judge prevents an identified, fixed error from coming back. It's almost always bottom-up: error analysis found the failure, somebody fixed it, and the judge now stands guard. The support bot used to promise refunds outside policy; you patched the prompt; the regression judge makes sure no future prompt edit, model swap, or retrieval change quietly reintroduces it. Its natural state is passing. That's not the useless kind of low variance: you can name the exact historical trace that would flip it, so there's a real failure it's still watching for.
A capability judge asserts something the product should be able to do, falsifiable from the spec. It's almost always top-down: the booking agent must confirm party size, the SQL agent must produce queries that execute against the real schema. Capability judges are where headroom lives. They're the ones you hillclimb against.
Both classes render binary verdicts: pass or fail, no 1-to-5 scales (the goals spoke makes the full case). Binary verdicts force falsifiability: a 3 out of 5 can't be proven wrong, but a fail on "did the agent check availability before quoting" can be settled by reading the trace. They also make aggregates mean something. An average score of 3.7 hides everything; an 84% pass rate means 16% of traces failed in a specific way you can pull up and read, one by one.
The classification isn't bookkeeping. It tells you how to interpret each judge's pass rate, and it's one axis of the matrix that shows you the shape of your whole suite.
The Grain x Goal matrix
You weigh judge candidates one at a time, but suites fail as a whole. The tool for seeing the whole is a grid: measurement grain on one axis, goal on the other.
Grain is the unit a judge renders its verdict on, the altitude decision from the grains spoke:
| Grain | What it covers | Example judge |
|---|---|---|
| Session | A whole multi-turn conversation | Did the user's issue get resolved by the end? |
| Trace | One full run for a single user turn | Did the agent's answer cite a retrieved document? |
| Span | One step inside a trace | Was the tool call schema-valid? |
| Event | One action inside a span | Does the SQL the agent wrote execute? |
Goal is the classification from the previous section: capability or regression.
Now plot every judge in your suite on this grid. Two pathologies jump out immediately.
Gaps. An empty cell is uncovered surface. No regression judges at the span level means a tool integration can silently break and the first you'll hear of it is a vague dip in coarser-grained scores, days later, with no pointer to the cause.
Crowding. Five session-level capability judges all sitting near 100% is ceremony, not measurement. They overlap, they've stopped varying, and they're burning run cost to tell you what you already know. Crowded cells are where you graduate saturated capability judges to regression duty, retire the redundant ones, and sharpen what's left into something with headroom.
Reviewing the matrix quarterly is cheap (it's a spreadsheet) and it turns "do we have good evals?" from a feeling into a picture.
The matrix tells you which cells deserve coverage. It doesn't tell you whether a particular candidate is worth building, and most candidates aren't. That is the final filter, and it is deliberately brutal.
A four-by-two matrix mapping eval measurement grains (session, trace, span, event) against judge goals (regression and capabilities), with an example judge and typical scorer in each cell.
The judge-worthiness gauntlet
A candidate judge, wherever it came from, must pass four gates, in order. Most candidates should die here. That's the point: every judge you ship is a permanent line item of run cost and calibration attention, and the suite gets less trustworthy with every judge that doesn't earn its slot.
Gate 1: Severity x Frequency. Is this failure worth the cost of a judge at all? A failure that double-charges customers once a week clears the bar instantly. A formatting quirk in 0.01% of traces does not. If the product of how bad and how often doesn't justify build cost plus per-trace run cost plus upkeep, stop here.
Gate 2: Variance. Could this judge ever fail? If you can't describe a realistic trace that would flip the verdict, the judge measures nothing, a smoke detector with no battery. Note the test is could flip, not does: a judge that mostly passes can still clear this gate, but one that could never fail cannot.
Gate 3: Actionability. When it fires, can anyone do anything with the output? A verdict should point toward a prompt section, a tool definition, a retrieval step, somewhere. If the failure report would just say "be better", stop.
Gate 4: Deterministic check. Before you reach for an LLM judge, ask whether code can do the job. Did the SQL execute? Is the JSON valid? Did the agent call the refund tool without an approval flag? String matching, schema validation, and tool-call inspection are nearly free, perfectly reproducible, and never need calibration. The LLM judge is the fallback for genuinely fuzzy properties, not the default.
The survivors are your suite. Everything that makes it through already has a home: a goal and a grain, a single cell on the matrix from the last section. What's left is to actually build the thing, wiring the judge to your traces and your harness, which is the next spoke.
If you remember nothing else
- 01 Judges come from two places: your traces (bottom-up error analysis) or your spec (top-down). You need both.
- 02 Read your data. Clustered, counted failures beat the failures you imagine every time.
- 03 Every candidate runs the gauntlet in order: severity x frequency, variance, actionability, deterministic-check-first.
- 04 If no realistic trace could flip a judge's verdict, it measures nothing. That is what the variance gate kills.
- 05 Prefer code-based checks; an LLM judge is the fallback for genuinely fuzzy properties, not the default.
- 06 Plot every judge on the grain x goal matrix: gaps are uncovered surface, crowding is dead weight.