Section 05 · What is the point of evals?
Eval Goals
Hillclimbing and floor-raising are different jobs. Label which one each judge does.
Every judge has exactly one job
Before you write a single line of rubric, decide what the judge is for. Ben Hylak's framing: every eval can Benchmark-Maxx or Raise the floor, but never both at once. In plainer terms, every judge in your suite is either a capability judge or a regression judge.
Capability judges measure progress toward something your product can't reliably do yet. Regression judges guard things it already does, so a prompt tweak or model swap doesn't quietly break them. Same machinery (a rubric, a verdict per trace, a pass rate) but opposite shapes:
| Capability | Regression | |
|---|---|---|
| Born from | A spec for what "good" looks like | A production failure you already fixed |
| Healthy score | Starts low, climbs over time | Pinned near 100% |
| Movement means | Progress, or a stalled bet | An alarm |
| Correct reaction | Keep hillclimbing | Page someone |
This isn't taxonomy for its own sake. The two kinds of judges answer different questions and demand different reactions when they move. A capability judge dropping five points is a Tuesday: you tried something and it didn't work. A regression judge dropping five points is an incident: a bug you already killed is back. If you can't tell at a glance which kind of judge just moved, you can't tell which reaction is correct.
Capability judges: built to be hillclimbed
A capability judge starts with a definition of "good" and measures how far away you are. Take a SQL agent: "given a question requiring a multi-table join, the agent writes a query that executes and returns the correct rows." On day one that judge might pass 35% of cases. That's not a failing eval. That's the whole point. You now have a number to climb, and every prompt change, retrieval tweak, and model upgrade gets scored against it.
The defining property: capability judges are falsifiable from a spec. You don't need a single production trace to build one, because the definition of success comes from what the product is supposed to do, not from failures you've observed. This is why they're the only judges you can build pre-launch. They exist before users do.
Two health checks for a capability judge:
- It should start low. If a new capability judge opens at 95%, you measured something you already had. Either the rubric is too easy or you're celebrating the wrong milestone. Tighten it until there's a hill worth climbing.
- It should move when you ship. A capability judge that sits flat through six weeks of changes is telling you one of two things: your changes aren't touching that capability, or the judge isn't sensitive to the property you think it measures. Both are worth knowing; neither is a reason to ignore the number.
Regression judges: the floor patrol
Regression judges run the same loop in reverse: they start from a failure, not a spec. The lifecycle looks like this: you spot an error in production traces (say, your hotel-booking agent confirming nonrefundable rooms without flagging the cancellation policy), you fix the underlying issue, and then you write a judge that checks for that exact failure on every run. The judge exists so the bug can never come back silently.
Because the bug is already fixed, a healthy regression judge sits at or near 100% from the day it ships. That makes its reading trivial: 100% means the floor is holding, anything less means a known failure mode has returned and someone should get paged. There's no judgment call, no trend analysis, no "let's watch it for a week." A dip is a bug report with the repro attached.
This is also why regression judges are the easiest judges to write well. You have the failing trace in hand, you know exactly what wrong looks like, and falsifiability comes for free: the failure already happened, so nobody can argue it's hypothetical. If your team is new to evals and you already have production traffic, regression judges built from real errors are the highest-confidence place to start.
The one trap: a regression suite only protects against failures you've already seen. It raises the floor; it never raises the ceiling. Teams that ship only regression judges end up with a product that never gets worse and never gets better.
Binary verdicts, for both jobs
Whatever the goal, each judge should emit a binary verdict per case: this trace passed or it failed. Resist the 1-5 scale. Scores feel more sophisticated, but they smuggle ambiguity into both jobs.
For regression judges the case is open and shut. The judge exists to detect one specific, already-fixed failure: either the failure is present in the trace or it isn't. A 3 out of 5 on "did the cancellation-policy bug come back" is not a measurement, it's a shrug. And alarms need thresholds: "page someone when the score dips below 4.1" invites a quarterly debate about where 4.1 came from. "Page someone when a case fails" doesn't.
For capability judges the argument is subtler but just as real. You still get a number to hillclimb: it's the pass rate across cases, not an average score per case. The difference matters: a pass rate moving from 35% to 60% means an extra quarter of your cases now clear the bar, and you can verify it by reading the newly passing traces. An average score moving from 3.1 to 3.4 means... something got slightly more 3.4-ish. Nobody can pull a trace and prove the judge wrong, which means the metric quietly fails the falsifiability gate.
If a property genuinely seems to need a scale, that's usually several binary judges in a trench coat, the same trench coat from the Specific gate. "Response quality: 4/5" decomposes into "cited a source: pass," "answered the actual question: pass," "under the length limit: fail." Each piece is checkable. The composite never was.
The unlabeled dashboard problem
Here's how mixing the two goals wrecks a dashboard. Picture a suite of 30 judges with an average pass rate of 76%. Is that good? Unanswerable. Say 12 are regression judges that should read 100% and 18 are capability judges mid-climb. That 76% could be every floor holding while capabilities average 60%, nothing broken, progress on schedule. Or it could be three returned bugs dragging three regression judges down to 40% while the capability judges, now averaging 70%, paper over the damage. Run the arithmetic: both scenarios land on exactly 76%. Same number, opposite realities.
This is the failure mode from the hub: smart teams skip labeling which camp each judge belongs to, the two kinds end up interleaved on one dashboard, and soon nobody, including the people who built the judges, can answer "are we getting better?" without an archaeology session.
The fix costs almost nothing:
- Tag every judge with its goal, in the judge's name or metadata, at creation time. Not in a doc somewhere, on the judge itself.
- Split the dashboard. Regression judges get a panel where the only interesting state is "not 100%," wired to alerting. Capability judges get a panel read as trend lines, reviewed when you ship changes.
- Never average across the two. Any aggregate that blends a floor metric with a hill metric produces a number with no decision attached to it.
Once split, each panel answers one question crisply. The regression panel answers "did we break anything we'd fixed?" The capability panel answers "is the product getting better?" Together they cover the state of your product. Blended, they cover nothing.
When capability judges graduate
The two camps aren't permanent assignments. The healthiest judges in your suite will switch sides exactly once.
A capability judge that climbs from 35% to 98% and holds there has done its job: the capability landed. Continuing to hillclimb it is wasted effort at best and Goodhart bait at worst: squeezing out the last two points usually means overfitting your product to the judge's rubric rather than improving anything a user would notice.
But don't retire it. Graduate it. Move the judge to the regression panel, flip the expectation from "should climb" to "should hold near 100%," and wire a dip to alerting instead of to a roadmap. The multi-table-join judge that your SQL agent spent a quarter climbing becomes the tripwire that catches the model upgrade which silently breaks joins next year. Nothing about the judge changes: same rubric, same cases, same binary verdicts. What changes is the question it answers and who reacts when it moves.
This graduation path is also why the capability/regression label has to live on the judge and not in someone's head: labels that only exist in tribal knowledge don't survive the handoff. A judge built by the hillclimbing team in March gets read by the on-call engineer in November, and the on-call engineer needs to know that 96% is an alarm, not a pretty good score.
Then go build the next capability judge. The hill you just finished climbing was never the last one.
If you remember nothing else
- 01 Every judge is either a capability judge or a regression judge. Pick one before you write the rubric, never both.
- 02 Capability judges start low and climb; if one opens at 95%, you measured something you already had.
- 03 Regression judges are born from fixed production bugs, sit near 100%, and page someone when they dip.
- 04 Both goals want binary pass/fail per case. 1-5 scores smuggle in ambiguity neither job can afford.
- 05 An unlabeled suite that blends the two makes "are we getting better?" unanswerable. Tag every judge, split the dashboard.
- 06 Saturated capability judges do not retire. They graduate into regression judges.
Further reading