Section 07 · Which judges deserve to exist?

Building Judges

One good judge is better than 20 lousy ones

Bottom-Up versus Top-Down

Bottom-Up judge development processes trace their roots to the Error Analysis camps of Hamel Husain. In this model, judges should only be built once the failure mode has been witnessed in production. More often than not, these are regression judges. The process typically looks like:

Pull ~100 traces from production
Assign an annotation to the error that you see (if any)
Group the errors into buckets
Build a judge that looks for that specific bucket

This approach should be strongly considered and revisited often. By working from your data back to the judges you want to build, you will gain a greater understanding of your product's real behavior in production. Bottom-Up eval development shines in initial eval suite build outs, eval maintenance, and instills excellent business practices (look at your data!).

Top-Down judge development processes are built more from the gut than the database. These types of evals are meant to measure your product's behavior against an ideal. Also known as spec-driven evals, these evals are mostly the conversion of product commitments into checks. In extreme cases, some practitioners even recommend guiding the entire product development process from evals first. This is something known as Eval-driven development. The process of top-down judge development typically looks like:

Create initial spec/PRD
Identify commitments within PRD that align with agent behaviors or capabilities
Build judges that capture performance on those capabilities

While this approach should also be strongly considered, it is much harder to get right. Vague or inconsistent definitions within the spec can result in weak or useless judges, and in a world where specs can change week-over-week, it may be challenging to maintain synchronization between your judges and a rapidly evolving PRD. Most top-down judges are capability judges, but there are some instances (must-have guardrails) where regression judges come from top-down thinking.

That said, top-down judges are excellent for measuring progress and defining goals. Maintaining measurement focused on your product's flagship capabilities is always a smart move. These judges require lots of thought, calibration, and effort to get right, but you will be rewarded handsomely if you can do so.

The grain × goal matrix

Once you have identified what you want to measure through your top-down or bottom-up approach, you must choose where to position the judge to get the most amount of signal for your efforts.

Grain

Remember there are four primary grains that we consider when choosing the altitude to deploy our judge:

Grain	What it covers	Example judge
Session	A whole multi-turn conversation	Did the user's issue get resolved by the end?
Trace	One full run for a single user turn	Did the agent's answer cite a retrieved document?
Span	One step inside a trace	Was the tool call schema-valid?
Event	One action inside a span	Does the SQL the agent wrote execute?

Goal

Remember, your evals should always chase one of two goals: preventing failures or measuring successes. These are regression judges and capabilities judges respectively.

Goal	What it covers	Example judge
Regression	Identify failures	Did the SQL agent write executable code?
Capabilities	Identify current capabilities	Are factual claims grounded in the cited sources?

Matrix

We can combine these two ideas into a 2-dimensional array (aka the grain x goal matrix). Plotting your judges against this matrix makes it easy to understand your current coverage of a given problem and opportunities for improvement in your eval suite.

Calling on a mental model of this matrix can help tremendously when contemplating how to build your judge:

Ascending on the grain axis reveals your first option for scorers: finer grains at the event and span level can likely get away with low-cost, deterministic graders while coarser grains at the trace level and above require the nuance and understanding that typically can only come from model- or human-based graders
Crowded cells typically represent redundancy and are prime opportunities to easily cut costs, especially at higher grains
This matrix prevents you from ungrounded claims of comprehensive coverage. Many times when anecdotal evidence conflicts with your judge score it is the result of broken down coverage and not the result of bad judges.

A four-by-two matrix mapping eval measurement grains (session, trace, span, event) against judge goals (regression and capabilities), with an example judge and typical scorer in each cell.

Regression judges

Guard observed failures

Capabilities judges

Measure spec attainment

Session

Multi-turn conversation

Did the agent make the user repeat info already given?

LLM judge over full transcript

Was the user's stated goal accomplished by session end?

LLM judge + outcome signal

Trace

One user turn

Does the response contain a fabricated citation?

URL check + LLM judge

Is every factual claim grounded in a retrieved source?

LLM judge + retrieved docs as privileged context

Span

Step within a turn

Did retrieval return zero docs, yet the agent answered anyway?

Deterministic check on span output

Does the plan cover every sub-task in the request?

LLM judge + request decomposition

Event

Atomic action

Does this SQL call run a destructive statement without confirmation?

Deterministic (regex / AST)

Was the right tool selected for this step?

Classifier vs. labeled set

Judge worthiness

Once you have identified the problem you want to solve and where to position your judge, it is important to do a final check to assess whether this judge deserves to exist. Without this step, you may overbuild and reduce your ROI from over-investing in your eval suite. You must run the judge through this gauntlet before building.

Feasible: Are failures severe and/or frequent enough to warrant the cost? This step matters more when operating with more costly graders.
Actionable: Is a failure case something that has an obvious component connection? When it fails do you know where to look or is the answer just "be better"? Many fail this step.
Specific: Is the failure/success case limited to one binary property? Are we looking at the sum of all the parts or just one part?
Testable: Is the fail/success state definable? Could the verdict change? Judges that always pass or always fail don't really measure much of anything.
A final deterministic check is also helpful here. It is good to think about whether a code-based check is good enough (this is useful for online evals where time is an integral component).

Initially, most of your judge ideas will fail this gauntlet (and that's a good thing). Your goal here is to build out muscle memory for thinking through what makes a good judge. Once you have mastered this method, you can rest assured knowing that every judge in your suite has a reason to exist.