Section 09 · Is this judge actually measuring what we want it to?

Judge Calibration

Your judge is a model too. Calibrate it like one.

Why calibrate at all

Your judge is an LLM grading another LLM. Nothing about that arrangement guarantees the grades mean anything. The judge will happily produce confident verdicts with consistent formatting and plausible reasoning, none of which tells you whether it's catching the failures you actually care about. An uncalibrated judge is a random-ish number generator with good vibes.

Calibration is the fix, and the idea is simple: take traces that trusted humans have already labeled, run the judge on the same traces, and measure how often they agree. The judge is a model, so you evaluate it like one, against ground truth.

Skip this step and you get a familiar failure pattern. The team ships a hallucination judge, the dashboard goes green, everyone relaxes, and support tickets about made-up refund policies keep arriving. The judge was measuring something (maybe "does the response cite a source", maybe "does it sound confident"), just not the thing on the label. Now you have two problems: the original failure mode, plus a dashboard actively telling you it doesn't exist.

This spoke is the third stage of the loop for a reason. You defined what to measure, you designed a judge to measure it. Calibration is where you find out whether the judge you built is the judge you designed. Until then, every pass rate it produces is a rumor.

The input side: spot checks and labeling passes

Calibration needs human labels, and there are two ways to get them: one cheap and continuous, one structured and occasional.

The cheap one is spot checking: regularly pull a sample of traces and read them next to the judge's verdicts. Twenty or thirty a week is plenty. You're doing error analysis on the judge itself. Every disagreement between your read and the judge's verdict is a calibration data point, and the pattern of disagreements usually points straight at the broken part of the rubric. Most judge bugs are found this way, by a human going "wait, that's not a hallucination." There is no substitute, and there is nothing cheaper.

The structured one is a labeling pass, and the big labs run these as standing programs. You don't need their headcount; you need the shape:

A written rubric, ideally the same one your judge prompt is built from, so humans and judge are grading the same thing.
Trained labelers who've worked through example traces together before labeling solo. "Trained" can mean two engineers and an hour of arguing over ten traces.
Overlap: have multiple people label the same subset so you can check inter-rater agreement (more on that below).
Disagreement resolution: when labelers disagree, fix the rubric so the next person wouldn't.

The output is a labeled set you trust. That set is the ground truth everything in the next two sections gets computed against.

The output side: the confusion matrix family

Once you have labeled traces, judge calibration becomes an ordinary classification problem, and the confusion matrix metrics apply directly.

Concrete example: a hallucination judge, run on 100 traces your team labeled by hand. Humans found 40 hallucinations and 60 clean traces. The judge flags 38 traces: 32 are real hallucinations (true positives), 6 are clean traces wrongly flagged (false positives). It misses 8 real hallucinations (false negatives) and correctly passes 54 clean traces (true negatives).

Metric	The question it answers	Here
Accuracy	How often is the judge right overall?	86%
Precision	When the judge fires, how often is it right?	32 of 38 = 84%
Sensitivity (TPR)	Of the real failures, how many does it catch?	32 of 40 = 80%
Specificity (TNR)	Of the clean traces, how many does it correctly pass?	54 of 60 = 90%

Which number you optimize depends on what the judge is for. If it gates releases, sensitivity is king. Every missed failure ships to users. If it files bug reports to engineers, precision is king. False alarms burn trust fast, and a judge that cries wolf gets muted. Most judges need a floor on both, but you should be able to say which one you'd sacrifice first.

One number conspicuously absent from the priority list: accuracy. The next section is about why.

Base rate neglect, or why accuracy lies

That example had a 40% failure rate, which made the arithmetic friendly. Production isn't friendly. Real failure rates for a specific failure mode are often 1-5%, and at those rates, headline accuracy becomes actively misleading. This is base rate neglect, and it bites nearly every team the first time.

Walk the numbers. Your product produces 1,000 traces; 2% contain the failure: 20 bad traces, 980 clean. Your judge is genuinely good: 95% sensitivity, 95% specificity. Roughly 95% accurate. Sounds shippable.

Now watch it run. It catches 19 of the 20 real failures, great. But it also wrongly flags 5% of the 980 clean traces: 49 false alarms. So the judge fires 68 times, and it's right 19 of them. Precision: 28%. A "95% accurate" judge that's wrong nearly three times out of four when it fires. Your engineers triage the first dozen flags, find mostly noise, and quietly stop reading the channel. Two weeks later the judge catches something real and nobody looks.

What to do about it:

Report precision at the operating point, not accuracy. "When this fires, it's right X% of the time" is the number people actually experience.
Push specificity hard. At a 2% base rate, each point of specificity removes about 10 false alarms per 1,000 traces; each point of sensitivity recovers 0.2 real failures. The leverage is lopsided.
Match the workflow to the precision. A 28%-precision judge is a fine triage filter feeding human review. It is not an auto-filer of bug tickets.

Agreement metrics: Cohen's kappa

Raw percent agreement has the same disease as accuracy: it's inflated by chance, badly so when labels are skewed. At a 2% failure rate, a "judge" that stamps PASS on everything agrees with humans 98% of the time while catching zero failures. Cohen's kappa corrects for this: it measures agreement beyond what you'd expect from chance given how skewed the labels are, where 0 means no better than chance and 1 means perfect. That all-PASS judge scores exactly 0.

Rough working bands, with the usual caveat that the cutoffs are conventions, not laws:

Below 0.4: the judge is closer to noise than signal. Don't automate anything with it.
0.4 to 0.8: real but unreliable agreement. Read the disagreements and fix the rubric or prompt before trusting it.
0.8 and up: the judge agrees with your humans about as well as careful humans agree with each other. This is the zone where automating the measurement is defensible.

And that comparison point matters more than the bands: measure human-human kappa first. Have two trained labelers grade the same traces independently and compute kappa between them. If your humans only hit 0.5 against each other, the judge has no target. No prompt engineering will make a model agree with a ground truth that doesn't hold still. That's a rubric problem, or a sign the property isn't falsifiable yet, and it sends you back to the FAST gates rather than back to the judge prompt.

Human-human agreement is your judge's ceiling. A judge whose kappa against humans approaches the humans' kappa against each other is done; past that you're tuning noise.

The judge calibration loop

Calibration isn't a ceremony you perform once at launch. It's a loop:

Label a sample of fresh traces with humans, against the written rubric.
Measure agreement: kappa against the human labels, plus precision and sensitivity at your operating point.
Read the disagreements. Every judge-versus-human mismatch is a clue. Usually the fix is in the rubric or the judge prompt: an ambiguous criterion, a missing edge case, a few-shot example teaching the wrong lesson.
Fix and re-measure, on a fresh labeled sample, not the one you just tuned against. Tuning and measuring on the same traces is how you overfit a judge and ship the overconfidence.
When agreement clears your bar, automate, and keep a standing weekly spot-check sample so drift gets caught.

Then recalibrate. On a schedule, sure, monthly, or every N-thousand traces. But the mandatory triggers are events, not dates:

The judge's model changes (provider upgrade, model swap, even a quiet API revision).
The judge's prompt or rubric changes, including "minor wording cleanups".
The product shifts under the judge: new feature, new user segment, an upgrade to the application's own model. New distribution of traces means the old calibration sample no longer represents what the judge sees.

A judge calibrated in March against a product that swapped models in May is measuring last quarter's product. The dashboard will keep updating daily either way, which is exactly the problem.

The judge calibration loop — Label traces, measure agreement, fix the rubric, re-measure on fresh data, and re-enter the loop on a schedule or whenever the judge model, prompt, or trace distribution changes.

If you remember nothing else

01 An uncalibrated judge produces confident verdicts that mean nothing. Compare it to human labels before trusting a single pass rate.
02 Reading traces next to judge verdicts is the cheapest calibration there is. Do it weekly, forever.
03 Sensitivity is the share of real failures caught; precision is how often a firing judge is right. Know which one you would sacrifice first.
04 At a 2% failure rate, a 95%-accurate judge is wrong most of the times it fires. Precision at the operating point beats headline accuracy.
05 If humans cannot agree with each other, the judge has no target. Measure human-human kappa before judge-human kappa.
06 Recalibrate on a schedule, and always when the judge model, the judge prompt, or the trace distribution changes.