Section 02 · What do good evals look like?

FAST Evals

The four gates every eval should clear before it earns a spot in your suite.

Falsifiable

A judge's verdict has to be provably wrong. If two reasonable people can look at the same trace and argue about whether it failed, you don't have an eval yet; you have a vibe with a number attached.

"The response feels unhelpful" is not falsifiable. "The agent confirmed a reservation without checking the party size against the restaurant's maximum" is. You can pull the trace, point at the line, and nobody can argue with you.

The practical test: for any verdict your judge produces, can you construct the counter-example that would flip it? If a failing trace can't be shown to be wrong (or a passing trace shown to be right), the judge will drift toward whatever is cheapest to claim. This matters double for LLM-as-a-judge setups, where an unfalsifiable rubric invites the judge model to hallucinate justifications in either direction.

Actionable

When the judge fires, you should already know roughly where to look. An eval that tells you something is wrong without telling you what to change just relocates your confusion from the product to the dashboard.

The classic offender is the composite score. "Helpfulness dropped from 3.2 to 3.1" triggers a meeting, not a fix. Compare that to "the agent stopped confirming dietary restrictions before booking". That points at a specific prompt section, tool definition, or retrieval step. One of these gets fixed by Friday; the other gets a task force.

A useful habit: before shipping a judge, write down the sentence you'd put in the bug report when it fails. If you can't write that sentence, the judge isn't actionable yet. Decompose it until you can.

Specific

One judge, one property. "Quality" and "Relevance" sound like metrics but behave like fog: a drop could mean failures across a large swath of levers (tone, factuality, formatting, tool use, latency) and you can't tell which knob moved.

Specificity is what makes the other gates achievable. A judge scoped to "does the SQL the agent wrote actually execute?" is trivially falsifiable and instantly actionable. A judge scoped to "is this a good analyst?" is neither.

The smell test for specificity: when this judge's pass rate moves, is there exactly one category of explanation? If a regression could be blamed on three unrelated subsystems, you're looking at three judges wearing a trench coat.

Tractable

Every judge has a price: the cost to build it, the cost to run it on every trace, and the ongoing cost to keep it calibrated. Tractability asks whether the failure mode you're catching is worth that bill.

The mental model is severity times frequency. A failure that bricks a user's account once a week deserves a judge, even an expensive one. A cosmetic glitch that appears in 0.01% of traces probably doesn't: the engineering time is better spent elsewhere, and the judge becomes one more thing that pages someone at 2am.

Tractability is also where deterministic checks earn their keep. If a cheap code-based grader catches 90% of a failure mode, that's usually a better investment than an LLM judge catching 97% at 100x the cost per trace. You can always upgrade later, once the failure mode proves it deserves the spend.

When metrics fail the gates

Not every useful number passes all four gates, and that's fine, as long as you're honest about what each number is for.

Sentiment metrics are the canonical example. "Users seem frustrated this week" isn't immediately actionable and isn't especially specific. But it's still a signal worth tracking, because it tells you where to go digging. The mistake isn't building these metrics; it's treating them like judges: gating releases on them, or assigning them to engineers as if they were bug reports.

A clean way to keep this straight: FAST metrics gate and assign work. Everything else is monitoring, context that informs where you point your FAST evals next. If a metric that fails the gates starts driving decisions on its own, that's your cue to decompose it into judges that pass.

If you remember nothing else

  • 01 Falsifiable: every verdict must be provably wrong. If people can argue about a failure, it is not an eval yet.
  • 02 Actionable: a firing judge should point at the fix. Write the bug-report sentence before you build the judge.
  • 03 Specific: one judge, one property. Composite scores are several judges in a trench coat.
  • 04 Tractable: severity x frequency has to justify build + run + calibration cost.
  • 05 Metrics that fail the gates can still be useful, as monitoring, never as gates.