# The Hitchhiker's Guide to Evals
https://shipfastevals.com

A mildly-technical guide to AI evaluations by Ryan Hartman (https://theryanhartman.com).
This file contains the full site content in one document. Hub-and-spoke layout:
the landing page summarizes each section, and each section links to a deep dive.

## Who is this for?

I aim to provide a mildly-technical guide to understanding AI Evaluations (Evals) here. This guide is a synthesis of many excellent Eval focused resources and my own domain knowledge from building evals at Meta & LOOT. Mastering the information here will give you enough knowledge to make you dangerous, at least when it comes to assessing AI models and/or AI-based product quality.

My expectation is that the average person coming to this site is working on an AI-based product or is interested in implementing evals better than vibes. To that end, I won't spend a lot of time explaining why evals are important and will instead focus on what it takes to build evals.

Each section outlined in this main page is discussed in-depth on the linked sub-pages, so feel free to click into any area that interests you. One last note: All the content here is human-written, and I am notoriously bad at [mentally modeling](https://xkcd.com/2501/) what you will know, so I have tried my best to make the site as AI-accessible as possible in hopes that you pass the URL to your favorite chatbot for it can explain any areas I may have glossed over.

## TL;DR

- Good evals are hard but necessary.
- If you're at 100% on all your evals, you're probably doing it wrong.
- Look at your data.
- People are important.
- Your evals should be FAST: Falsifiable, Actionable, Specific, and Tractable.

## The Eval Loop

The day-to-day cycle: Define -> Design -> Calibrate -> Run -> Report -> Improve -> (repeat).
Foot-guns lurk at every stage.

While the loop is what you will practice day-to-day, this page is organized conceptually with each section building upon the last. If you want to jump ahead, feel free to click on a part of the loop that interests you.

Evals can be applied to a wide variety of characteristics of your model or agent. The common types of measured characteristics are:

1. Model Behavior
2. Reasoning
3. Agentic Behavior
4. Knowledge
5. Safety
6. CIA (Confidentiality, Integrity, Availability)

For the bulk of this site we will focus on agentic behavior since that is one of the most common use cases.

---

## 01. Evaluations (https://shipfastevals.com/what-to-measure)
What are you measuring?

Evaluating the quality of your AI-based product presents an entirely different set of challenges than measuring traditional software. Where traditional software is deterministic, i.e., users will always arrive at the same place if they click the buttons in the same order, AI & agent-based products are inherently black boxes. In this new world, inputs and outputs are unbounded by default, creating a massive surface area for unexpected user & agent behaviors.

The core goal of Evals is to provide a window into understanding your model/product.

The best way to do that is by identifying what "good" and "bad" looks like. Once you understand how your product is failing, it becomes much easier to fix. In practice, this is the difference between hearing feedback like "Your product sucks" versus knowing "The agent focused on hotel reservations fails when booking for parties of 4 or more". When you have that kind of information, you can fix the underlying issue then loop through the eval process again to make the product even better.

### Why you can't just write tests

Traditional software is deterministic: click the same buttons in the same order and you land in the same place, every time. That property is what makes testing work. Assert that input X produces output Y, run it in CI, ship with some confidence.

AI products break this contract twice. The inputs are unbounded: users will type things you never imagined, in languages you didn't plan for, with goals you didn't design around. And the outputs are unbounded too: the same prompt can produce a different response on every run, and "different" sometimes means "subtly wrong in a way you haven't seen before." There is no assertEquals for "the agent handled that reasonably."

**Evals** are how you get the testing contract back. An eval takes a real or realistic input, runs your system on it, and applies a **judge**, a piece of code or a model, that decides whether the output was acceptable. String enough of those together and you have something resembling a test suite for a system you can't step through with a debugger.

That's the whole pitch. Evals are the window into the black box. Without them, your quality signal is whatever users happen to complain about, which is late, noisy, and heavily weighted toward your angriest one percent.

### Define 'good' before you measure anything

The trap most teams fall into is starting with metrics. Someone stands up a dashboard showing "average helpfulness: 4.1 / 5" and it feels like progress. But a number without a definition of good attached is decoration. When it drops, nobody knows what broke; when it rises, nobody knows what to keep doing.

The useful unit is the **failure mode**: a specific, repeatable way your product goes wrong. "The hotel agent quotes the refundable-rate cancellation policy on non-refundable bookings." "The SQL agent joins on the wrong key whenever a table has two date columns." Each of those is concrete enough to reproduce, count, and hand to an engineer with a straight face.

This reframes the entire job. You are not computing a quality score; you are building a catalog of the ways your product fails, ranked by how much each one hurts. Progress means crossing entries off the catalog and adding judges that keep them from sneaking back in.

Failure modes come from reading traces. Sit down with real transcripts of your agent working, note every moment that makes you wince, and group the winces. Ten traces in, you'll have themes. Fifty traces in, you'll have a roadmap. No metric will ever hand you that; you have to go look.

### The six properties you can measure

Nearly everything worth knowing about a model or agent falls into one of six buckets:

- **Model behavior**: does the model conduct itself the way you asked: tone, format, language, instruction-following. Example eval: a support bot must reply in the user's language and under 150 words; a cheap judge checks both on every trace.
- **Reasoning**: can it work through a problem instead of pattern-matching to an answer. Example eval: given a refund request that matches two recent orders, does the agent ask which order before issuing anything?
- **Agentic behavior**: does the agent pick the right tools, in the right order, and recover when one fails. Example eval: the hotel agent must call the availability check before it ever calls confirm-booking.
- **Knowledge**: does it know, or correctly retrieve, the facts the task needs. Example eval: quiz a RAG bot with policy questions that have known answers and grade against ground truth.
- **Safety**: does it refuse what it should and stay inside its lane. Example eval: a health-insurance bot must route medication-dosage questions to a pharmacist rather than answering them.
- **CIA**: the security triad. Confidentiality: does it leak user A's data to user B? Integrity: can a prompt injection buried in a retrieved document make it take actions nobody asked for? Availability: does it spiral into tool-call loops under adversarial input? Example eval: seed retrieved pages with injection attempts and check whether the agent obeys them.

Most products need real coverage in three or four of these buckets, not all six. Knowing which buckets matter for yours is half of defining what good looks like.

### Offline and online evals

Evals split into two families based on when they run, and you need both.

**Offline evals** run before a change ships. You keep a fixed dataset of inputs (real traces you've collected, synthetic edge cases, regression cases from past incidents) and every candidate change gets run against it and scored by your judges. Swap the model, rewrite the system prompt, tweak a tool definition: the offline suite tells you whether things got better or worse while the stakes are still zero.

**Online evals** run on production traffic. Judges score live traces, or a sample of them, and alert you when a failure mode starts climbing. Production is where users do things no offline dataset anticipated, and online evals are how you catch a regression before your support queue does.

The two feed each other in a loop. Online evals surface failure modes you didn't predict; those traces get pulled into the offline dataset; the offline suite then guards against that failure forever. A team running only offline evals is confident and blind. A team running only online evals finds every regression after their users do. Offline tells you whether to ship, online tells you what to fix next. Neither answers both questions.

### Why this guide centers on agents

Of the six buckets, **agentic behavior** gets the most airtime in this guide, because that's what most teams are actually shipping. The interesting AI products right now aren't single completions. They're agents: systems that take a goal, make a plan, call tools, read the results, and keep going. The hotel-booking agent, the SQL agent, the support bot that can actually issue the refund instead of apologizing about it.

Agents are also the hardest thing on the list to evaluate. A chat completion gives you one input and one output to grade. An agent gives you a **trace**: a multi-step transcript of model calls, tool calls, and intermediate decisions, where things can go wrong at every step, and where an early wrong turn can still produce a confident, plausible-looking final answer. The agent that books the wrong hotel politely is a worse failure than the one that errors out, and a final-answer check alone won't catch it.

The good news is that the hard case contains all the easy ones. A single-turn classifier is just an agent with one step, so everything ahead (datasets, judges, calibration, reporting) transfers down. The rest of the spokes assume you're working with agent traces, because that's both the common case and the one where sloppy evals hurt the most.

Key takeaways:
- AI products are black boxes with unbounded inputs and outputs. Evals are the window you build into them.
- Define good and bad before picking metrics. A score without a definition of good is decoration.
- Failure modes are the unit of progress: specific, repeatable, countable ways your product goes wrong.
- Six buckets cover what you can measure: model behavior, reasoning, agentic behavior, knowledge, safety, and CIA.
- Offline evals catch the regressions you predicted before they ship; online evals catch the ones you didn't.
- Agentic behavior is the common case and the hard one. If you can evaluate a ten-step trace, one step is easy.

Further reading:
- Anthropic: Demystifying evals for AI agents: https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents

---

## 02. FAST Evals (https://shipfastevals.com/fast)
What do good evals look like?

But knowing generally what you want to measure doesn't really tell us how to measure it. FAST evals solve this by defining a rubric for what a good judge should look like:

- **Falsifiable** evals help us only look at problems that can be undoubtedly proven wrong.
- **Actionable** evals help us know what to change when the judge finds a failure mode.
- **Specific** evals limit the measurement to a single property. Generic evals like "Quality" & "Relevance" fail this check since they could mean failures across a large swath of levers.
- **Tractable** evals are economically viable judges where the severity or frequency of failures make a judge worth building.

Not all evals will pass all 4 gates all of the time. There are many edge cases where metrics must be created that are not immediately actionable (e.g., sentiment metrics) but are still useful for your overall understanding of your product.

### Falsifiable

A judge's verdict has to be provably wrong. If two reasonable people can look at the same trace and argue about whether it failed, you don't have an eval yet; you have a vibe with a number attached.

"The response feels unhelpful" is not falsifiable. "The agent confirmed a reservation without checking the party size against the restaurant's maximum" is. You can pull the trace, point at the line, and nobody can argue with you.

The practical test: for any verdict your judge produces, can you construct the counter-example that would flip it? If a failing trace can't be shown to be wrong (or a passing trace shown to be right), the judge will drift toward whatever is cheapest to claim. This matters double for LLM-as-a-judge setups, where an unfalsifiable rubric invites the judge model to hallucinate justifications in either direction.

### Actionable

When the judge fires, you should already know roughly where to look. An eval that tells you something is wrong without telling you what to change just relocates your confusion from the product to the dashboard.

The classic offender is the composite score. "Helpfulness dropped from 3.2 to 3.1" triggers a meeting, not a fix. Compare that to "the agent stopped confirming dietary restrictions before booking". That points at a specific prompt section, tool definition, or retrieval step. One of these gets fixed by Friday; the other gets a task force.

A useful habit: before shipping a judge, write down the sentence you'd put in the bug report when it fails. If you can't write that sentence, the judge isn't actionable yet. Decompose it until you can.

### Specific

One judge, one property. "Quality" and "Relevance" sound like metrics but behave like fog: a drop could mean failures across a large swath of levers (tone, factuality, formatting, tool use, latency) and you can't tell which knob moved.

Specificity is what makes the other gates achievable. A judge scoped to "does the SQL the agent wrote actually execute?" is trivially falsifiable and instantly actionable. A judge scoped to "is this a good analyst?" is neither.

The smell test for specificity: when this judge's pass rate moves, is there exactly one category of explanation? If a regression could be blamed on three unrelated subsystems, you're looking at three judges wearing a trench coat.

### Tractable

Every judge has a price: the cost to build it, the cost to run it on every trace, and the ongoing cost to keep it calibrated. Tractability asks whether the failure mode you're catching is worth that bill.

The mental model is severity times frequency. A failure that bricks a user's account once a week deserves a judge, even an expensive one. A cosmetic glitch that appears in 0.01% of traces probably doesn't: the engineering time is better spent elsewhere, and the judge becomes one more thing that pages someone at 2am.

Tractability is also where deterministic checks earn their keep. If a cheap code-based grader catches 90% of a failure mode, that's usually a better investment than an LLM judge catching 97% at 100x the cost per trace. You can always upgrade later, once the failure mode proves it deserves the spend.

### When metrics fail the gates

Not every useful number passes all four gates, and that's fine, as long as you're honest about what each number is for.

Sentiment metrics are the canonical example. "Users seem frustrated this week" isn't immediately actionable and isn't especially specific. But it's still a signal worth tracking, because it tells you *where to go digging*. The mistake isn't building these metrics; it's treating them like judges: gating releases on them, or assigning them to engineers as if they were bug reports.

A clean way to keep this straight: FAST metrics gate and assign work. Everything else is monitoring, context that informs where you point your FAST evals next. If a metric that fails the gates starts driving decisions on its own, that's your cue to decompose it into judges that pass.

Key takeaways:
- Falsifiable: every verdict must be provably wrong. If people can argue about a failure, it is not an eval yet.
- Actionable: a firing judge should point at the fix. Write the bug-report sentence before you build the judge.
- Specific: one judge, one property. Composite scores are several judges in a trench coat.
- Tractable: severity x frequency has to justify build + run + calibration cost.
- Metrics that fail the gates can still be useful, as monitoring, never as gates.

---

## 03. Observability (https://shipfastevals.com/observability)
Can you trace what happened?

Before we can measure anything at all, we must log it. Unlike traditional deterministic software, agentic systems are unbounded in their inputs and outputs. Across longer time-horizons, this unbounded-ness can nullify the benefits of traditional event-based logging. Steps within spans and traces directly affect the steps that come after, so establishing the causal chain amongst your events is crucial to understanding your product.

Judges placed at the right part of the chain can help capture the effects of any changes you make long before it would be surfaced by your users.

### Log it before you can measure it

Every eval in this guide assumes one thing that almost nobody sets up first: the run was captured. A judge reads a trace and renders a verdict, but if the trace was never logged, there is nothing to read. You can have the sharpest rubric in the world and it grades thin air.

Deterministic software lets you cheat here. If a function misbehaves, you re-run it with the same inputs and watch it misbehave again, because the same inputs always produce the same execution. Agents don't grant you that. The same prompt can take a different path on every run, the agent mutates state as it goes, and the model version under you can change next Tuesday. The run is not reproducible from its inputs, which means **the run itself is the artifact**. If you didn't record it as it happened, it's gone, and so is any chance of judging it.

This is why observability comes before measurement grains, before scoring, before any judge. It is the substrate the rest of the loop stands on. The teams that struggle hardest with evals are usually not the ones with bad judges; they're the ones who went to build a judge, opened their logs, and found the one field they needed was never written down.

### Capture the chain, not just the events

Traditional logging is event-based: discrete, independent records. "Request received." "Tool called." "Response sent." For request-response software that's enough, because each event stands alone and the next one doesn't much care what the last one did.

Agents break that assumption. Step N's output is step N+1's input. The model reads a tool result, decides what to do next, reads the next result, and on and on. A failure eight steps back doesn't announce itself; it surfaces as a confident, plausible, wrong final answer, and by then the event that caused it is buried in a pile of unrelated events. A flat event log can tell you *what* happened. It usually can't tell you *what caused what*, and causation is the whole question.

The fix is to log the structure, not just the moments. The grains spoke lays out the altitudes, sessions hold traces, traces hold spans, spans hold events; observability is what records the **edges** between them: which span spawned which, what each step consumed, what it produced, where it branched. Stitch those together and a trace stops being a timeline and becomes a tree you can walk from a bad outcome back to the decision that caused it. Across long, multi-turn sessions that walkability is the difference between "the agent got worse this week" and "the agent started trusting a stale retrieval result on turn three."

[Flowchart: Tracing the causal chain]

### Put the judge where the signal is

Once the chain is captured, you get to choose *where* on it a judge looks, and the choice decides how early you find out something broke.

Hang every judge off the final response and you only ever catch failures at the end, after the agent has already acted on the mistake. Put a judge on the span where the decision actually happens and you catch the failure at its source. Take a research agent that retrieves a poisoned document and then reasons confidently from it: a judge on the retrieval span flags the bad source the instant it lands, while a judge that only reads the final answer sees a fluent summary and shrugs. Same failure, two very different warning times.

That's the practical reading of the idea that judges placed at the right part of the chain catch the effects of a change long before users do. A regression doesn't have to ride all the way out to a user complaint; a span-level judge sitting next to where the behavior lives can trip the moment a deploy changes it. The constraint is the obvious one: you can only judge a step you logged as its own distinct span. Instrumentation granularity sets the ceiling on judge placement, so the decision of what to capture in this stage quietly bounds every judge you'll be able to build later.

### What to actually capture

"Log everything" is the right instinct and useless as instruction. Concretely, a trace you can eval needs:

- **The real inputs.** The full system prompt, the user message, and whatever context got injected, retrieved chunks, memory, tool schemas. Not a summary of them.
- **The raw outputs.** What the model actually emitted, including the parts the UI hides. The cleaned-up chat bubble is where failures go to look fine; judge the raw output, not the rendered one.
- **Every tool interaction.** Tool name, the exact arguments the agent passed, and the exact result it got back. Most agent failures are tool-call failures, and they're invisible if you only logged that a tool was "called."
- **The intermediate state.** Reasoning traces, scratchpad steps, plan revisions, the points where the agent changed its mind. This is where process-level judges live.
- **The stitching IDs.** A trace id, span ids, and parent pointers, so the chain from the previous section can actually be reconstructed instead of guessed at from timestamps.
- **The reproducibility metadata.** Model version (the exact snapshot, never "latest"), prompt version, temperature, and timestamps. The scoring spoke's "pin everything" rule starts here: a score you can't attribute to a known configuration is a rumor.

The test is simple: could you replay this run, or explain exactly why it failed, from your logs alone? If not, you have telemetry, not observability, and the gap is precisely the failures you didn't think to capture.

### Instrument before you need it

Observability has an asymmetry that punishes procrastination: you can add a judge to last month's traces, but you cannot add logging to them. The data either exists or it doesn't. By the time a failure mode is interesting enough to measure, the traces that would have characterized it have already streamed past ungrabbed.

So the cost of under-instrumenting isn't paid today; it's paid the day you go looking for a signal and discover the relevant field was never written. And it's paid worst on exactly the failures you most want, the surprising ones, because those are by definition the ones you didn't build a bespoke log line for in advance. Rich, structured capture from the start is what lets error analysis (the bottom-up half of building judges) find failures you never predicted.

This is also the substrate online evals run on. Catching a regression in production before your support queue does means judges reading live traces as they happen, and that only works if production is already emitting full, chain-linked traces. Wire the instrumentation in early, decide your sampling and retention deliberately (you rarely need to keep 100% of traffic forever, but you do need an honest sample of the hard cases), and treat the trace schema as part of the product. Everything downstream in this loop is only as good as what this stage bothered to write down.

Key takeaways:
- You can only judge what you logged. Observability is the precondition for every eval downstream, which is why it comes before grain, scoring, and judges.
- Event logs record what happened; the causal chain records what caused what. Agents need the chain because each step feeds the next.
- Place judges at the point in the chain where a failure is born, not just on the final output, to catch effects before users do.
- Capture the raw run: full inputs, tool arguments and results, intermediate state, stitching IDs, and version metadata. Not the rendered UI text.
- Instrument before you need it. You can't retrofit a trace you never captured, and the failures worth catching are the ones you didn't predict.

Further reading:
- OpenTelemetry: Traces (spans, context, and causal links): https://opentelemetry.io/docs/concepts/signals/traces/

---

## 04. Measurement Grains (https://shipfastevals.com/grains)
How does altitude affect our evals?

Understanding the level of granularity you want to measure is critical for determining how you are going to build your judge. There are 4 main levels of measurement:

1. **Sessions**: The entirety of a multi-turn conversation between a user and the agent
2. **Traces**: The transcript of a single user turn (also called a trajectory and may contain multiple agent steps)
3. **Spans**: Each step within a trace
4. **Events**: Each action taken within a Span

There is a value vs. actionability tradeoff as you get more refined in your measurement grain. The value of improving the pass rates of judges at coarser levels like Sessions and Traces is readily apparent to end users, yet these judges are typically hard to action on. On the other hand, more granular judges that only look at one step in the agent's trajectory will be highly actionable but may not be as useful by themselves.

There are also cost and complexity tradeoffs as you refine your measurement grain. While the most granular levels (Spans & Events) can be cheaply checked with deterministic evals (e.g., "does this sql execute?", "is this the correct tool call format"), the coarser levels often require costly LLM-as-a-judge classification to capture all the potential nuances.

### Four altitudes, one conversation

Take one support interaction and slice it four ways. A user opens a chat with your billing agent, asks why they were double charged, gets an explanation, asks for a refund, and gets one. Same interaction, four grains:

- **Session**: the entire multi-turn conversation, from "why was I charged twice?" to "refund issued, anything else?" A session-level question: did the user leave with their problem solved?
- **Trace**: everything triggered by one user turn, also called a **trajectory**. The turn "can I get a refund?" might kick off five agent steps: look up the account, check refund eligibility, query the payments database, call the refund API, draft the reply. All of that is one trace.
- **Span**: a single step within that trace. The database query is a span. The refund API call is a span. A span-level question: did the eligibility check use the right account ID?
- **Event**: an individual action inside a span. Within the database-query span, the SQL string the agent wrote is an event, and so is the parsed result. An event-level question: does this SQL even execute?

Nothing about the interaction changed between these four views, only the altitude you're measuring from. That choice matters because it determines everything downstream: what a "failure" means, what kind of grader can detect it, what it costs to run, and who can act on the result. Pick the grain before you write a single line of rubric.

### The value vs. actionability tradeoff

Coarse grains measure what users actually feel. If your session-level resolution rate climbs from 70% to 85%, customers notice, support tickets drop, and somebody's dashboard turns green. The catch: when a session-level judge fails, you've learned that *somewhere* in a forty-turn conversation, across twenty traces and a hundred spans, something went wrong. Good luck assigning that to an engineer.

Fine grains flip the tradeoff. "The agent passed a string where the refund API expects an integer" is a bug report you can fix before lunch. But each individual event-level failure is low-stakes on its own: the API call gets retried, the user never notices, and a 2% malformed-call rate might not be why your resolution rate is stuck at 70%.

So neither end of the spectrum is sufficient. Coarse judges tell you whether the product works; fine judges tell you what to change. The connective tissue is decomposition: when a session fails, you want trace- and span-level judges already in place so you can localize the failure instead of re-reading transcripts. A suite that only measures sessions produces meetings. A suite that only measures events produces a wall of green checkmarks on a product nobody likes using.

### The cost and complexity tradeoff

Grain choice also decides what kind of grader you can get away with, and what it costs to run on every trace.

| Grain | Example check | Typical grader | Cost per check |
| --- | --- | --- | --- |
| Event | Does this SQL execute? | Deterministic code | Effectively free |
| Span | Is the tool call schema-valid? | Deterministic code | Effectively free |
| Trace | Did the agent resolve the user's request? | LLM-as-a-judge | Model call(s) |
| Session | Did the conversation stay coherent across turns? | LLM-as-a-judge, long context | Expensive model calls |

At the event and span level, "correct" is usually checkable by a program: run the SQL, validate the JSON against the tool schema, regex the date format. These checks are fast, deterministic, and never need calibration: a wrong verdict is a bug in your checker, not a judgment call.

At the trace and session level, "correct" stops being mechanical. Whether the agent actually resolved the request depends on context, phrasing, and what the user meant, nuance that only an LLM judge can classify, and even then only after you've calibrated it against human labels. You're paying a model call (sometimes several, over long contexts) per evaluation, plus the ongoing cost of keeping the judge honest.

The practical consequence: deterministic span and event checks are cheap enough to run on 100% of production traffic, while session-level judges often run on a sample. Build accordingly.

### Which grain does the failure live at?

When you find a failure mode worth measuring, the first design question isn't "what's the rubric?". It's "what's the smallest grain where this failure is fully visible?"

Work it from the bottom up. Malformed tool call? That's visible in a single event. Build a deterministic event check and stop. Agent queried the wrong table? Visible in one span. Agent answered the question but ignored half of what the user asked? You need the whole trace, because no single span is wrong: the failure is in what's missing. Agent contradicted something it said ten turns ago, or kept re-asking for the order number? Only a session-level judge can see that, because the evidence is spread across turns.

Measuring above the failure's natural grain wastes money and dilutes the signal: a session judge that exists to catch SQL errors is an expensive, noisy way to do what a one-line check does perfectly. Measuring below it is worse: the judge literally cannot see the failure, and you get confident green checkmarks on broken behavior.

This is also why real suites end up with judges at several grains. Your failure modes don't all live at one altitude, so your judges can't either.

[Flowchart: Picking a measurement grain]

### How grains interact with the FAST gates

Grain choice and the FAST gates pull on each other, mostly in one direction: the coarser the grain, the harder the gates get.

A session-level "conversation quality" judge is the canonical multi-gate failure. It flunks **Specific** because quality could mean tone, accuracy, latency, or formatting, several judges in a trench coat, again. It flunks **Actionable** because a score drop points at nothing in particular. And it strains **Falsifiable**, because two reviewers can read the same conversation and disagree about whether it was "good." None of this means session-level measurement is hopeless; it means session-level judges have to be scoped to one property that genuinely lives at that altitude: "did the agent re-ask for information the user already provided?" is session-level *and* passes all four gates.

Fine grains have the opposite profile. Span and event checks are falsifiable and specific almost by construction: the SQL executed or it didn't. Their gate to watch is **Tractable** in reverse: checks this cheap are tempting to build by the dozen, and fifty event checks nobody triages is its own failure mode.

The pattern that works: a small number of carefully scoped coarse judges that track what users feel, decomposed into cheap fine-grained checks that tell you what to fix. The coarse layer finds problems; the fine layer assigns them.

Key takeaways:
- Four grains, one interaction: sessions hold traces, traces hold spans, spans hold events. Pick the altitude before writing the rubric.
- Coarse grains measure what users feel but resist action; fine grains are instantly fixable but individually low-stakes.
- Spans and events are usually checkable with free deterministic code; traces and sessions usually need a calibrated LLM judge.
- Measure at the smallest grain where the failure is fully visible: coarser is wasteful, finer is blind.
- A session-level "quality" judge fails Specific by default. Scope coarse judges to one property that genuinely lives at that altitude.
- Real suites run judges at several grains: coarse judges find problems, fine judges assign them.

---

## 05. Eval Goals (https://shipfastevals.com/goals)
What is the point of evals?

As [Ben Hylak](https://www.howtoeval.com/#goal) put it, each eval can "Benchmark-Maxx" or "Raise the floor" but not both at once. In more common language, this delineates between Capability Evals and Regression Evals.

- **Capability evals** are built to be hillclimbed. In other words, these types of evals are meant to identify what "good" outcomes look like and define the metric that can be used to measure progress towards that goal. These types of evals will start low and should increase over time.
- **Regression evals** are built to protect from known failure modes. These types of evals are typically built by identifying common errors in production then creating a judge to test for those errors after the underlying issue has been fixed. Regression evals should stay near 100%.

A common failure mode I have seen incredibly smart people make at Meta has been a failure to explicitly denote which camp each of their judges belongs to. When regression and capability judges get mixed up, it can be hard to make sense of the state of your product.

### Every judge has exactly one job

Before you write a single line of rubric, decide what the judge is for. Ben Hylak's framing: every eval can **Benchmark-Maxx** or **Raise the floor**, but never both at once. In plainer terms, every judge in your suite is either a **capability judge** or a **regression judge**.

Capability judges measure progress toward something your product can't reliably do yet. Regression judges guard things it already does, so a prompt tweak or model swap doesn't quietly break them. Same machinery (a rubric, a verdict per trace, a pass rate) but opposite shapes:

| | Capability | Regression |
| --- | --- | --- |
| Born from | A spec for what "good" looks like | A production failure you already fixed |
| Healthy score | Starts low, climbs over time | Pinned near 100% |
| Movement means | Progress, or a stalled bet | An alarm |
| Correct reaction | Keep hillclimbing | Page someone |

This isn't taxonomy for its own sake. The two kinds of judges answer different questions and demand different reactions when they move. A capability judge dropping five points is a Tuesday: you tried something and it didn't work. A regression judge dropping five points is an incident: a bug you already killed is back. If you can't tell at a glance which kind of judge just moved, you can't tell which reaction is correct.

### Capability judges: built to be hillclimbed

A capability judge starts with a definition of "good" and measures how far away you are. Take a SQL agent: "given a question requiring a multi-table join, the agent writes a query that executes and returns the correct rows." On day one that judge might pass 35% of cases. That's not a failing eval. That's the whole point. You now have a number to climb, and every prompt change, retrieval tweak, and model upgrade gets scored against it.

The defining property: capability judges are **falsifiable from a spec**. You don't need a single production trace to build one, because the definition of success comes from what the product is supposed to do, not from failures you've observed. This is why they're the only judges you can build pre-launch. They exist before users do.

Two health checks for a capability judge:

- **It should start low.** If a new capability judge opens at 95%, you measured something you already had. Either the rubric is too easy or you're celebrating the wrong milestone. Tighten it until there's a hill worth climbing.
- **It should move when you ship.** A capability judge that sits flat through six weeks of changes is telling you one of two things: your changes aren't touching that capability, or the judge isn't sensitive to the property you think it measures. Both are worth knowing; neither is a reason to ignore the number.

### Regression judges: the floor patrol

Regression judges run the same loop in reverse: they start from a failure, not a spec. The lifecycle looks like this: you spot an error in production traces (say, your hotel-booking agent confirming nonrefundable rooms without flagging the cancellation policy), you fix the underlying issue, and *then* you write a judge that checks for that exact failure on every run. The judge exists so the bug can never come back silently.

Because the bug is already fixed, a healthy regression judge sits at or near 100% from the day it ships. That makes its reading trivial: 100% means the floor is holding, anything less means a known failure mode has returned and someone should get paged. There's no judgment call, no trend analysis, no "let's watch it for a week." A dip is a bug report with the repro attached.

This is also why regression judges are the easiest judges to write well. You have the failing trace in hand, you know exactly what wrong looks like, and falsifiability comes for free: the failure already happened, so nobody can argue it's hypothetical. If your team is new to evals and you already have production traffic, regression judges built from real errors are the highest-confidence place to start.

The one trap: a regression suite only protects against failures you've already seen. It raises the floor; it never raises the ceiling. Teams that ship only regression judges end up with a product that never gets worse and never gets better.

### Binary verdicts, for both jobs

Whatever the goal, each judge should emit a **binary verdict per case**: this trace passed or it failed. Resist the 1-5 scale. Scores feel more sophisticated, but they smuggle ambiguity into both jobs.

For regression judges the case is open and shut. The judge exists to detect one specific, already-fixed failure: either the failure is present in the trace or it isn't. A 3 out of 5 on "did the cancellation-policy bug come back" is not a measurement, it's a shrug. And alarms need thresholds: "page someone when the score dips below 4.1" invites a quarterly debate about where 4.1 came from. "Page someone when a case fails" doesn't.

For capability judges the argument is subtler but just as real. You still get a number to hillclimb: it's the pass rate across cases, not an average score per case. The difference matters: a pass rate moving from 35% to 60% means an extra quarter of your cases now clear the bar, and you can verify it by reading the newly passing traces. An average score moving from 3.1 to 3.4 means... something got slightly more 3.4-ish. Nobody can pull a trace and prove the judge wrong, which means the metric quietly fails the falsifiability gate.

If a property genuinely seems to need a scale, that's usually several binary judges in a trench coat, the same trench coat from the Specific gate. "Response quality: 4/5" decomposes into "cited a source: pass," "answered the actual question: pass," "under the length limit: fail." Each piece is checkable. The composite never was.

### The unlabeled dashboard problem

Here's how mixing the two goals wrecks a dashboard. Picture a suite of 30 judges with an average pass rate of 76%. Is that good? Unanswerable. Say 12 are regression judges that should read 100% and 18 are capability judges mid-climb. That 76% could be every floor holding while capabilities average 60%, nothing broken, progress on schedule. Or it could be three returned bugs dragging three regression judges down to 40% while the capability judges, now averaging 70%, paper over the damage. Run the arithmetic: both scenarios land on exactly 76%. Same number, opposite realities.

This is the failure mode from the hub: smart teams skip labeling which camp each judge belongs to, the two kinds end up interleaved on one dashboard, and soon nobody, including the people who built the judges, can answer "are we getting better?" without an archaeology session.

The fix costs almost nothing:

- **Tag every judge** with its goal, in the judge's name or metadata, at creation time. Not in a doc somewhere, on the judge itself.
- **Split the dashboard.** Regression judges get a panel where the only interesting state is "not 100%," wired to alerting. Capability judges get a panel read as trend lines, reviewed when you ship changes.
- **Never average across the two.** Any aggregate that blends a floor metric with a hill metric produces a number with no decision attached to it.

Once split, each panel answers one question crisply. The regression panel answers "did we break anything we'd fixed?" The capability panel answers "is the product getting better?" Together they cover the state of your product. Blended, they cover nothing.

### When capability judges graduate

The two camps aren't permanent assignments. The healthiest judges in your suite will switch sides exactly once.

A capability judge that climbs from 35% to 98% and holds there has done its job: the capability landed. Continuing to hillclimb it is wasted effort at best and Goodhart bait at worst: squeezing out the last two points usually means overfitting your product to the judge's rubric rather than improving anything a user would notice.

But don't retire it. **Graduate it.** Move the judge to the regression panel, flip the expectation from "should climb" to "should hold near 100%," and wire a dip to alerting instead of to a roadmap. The multi-table-join judge that your SQL agent spent a quarter climbing becomes the tripwire that catches the model upgrade which silently breaks joins next year. Nothing about the judge changes: same rubric, same cases, same binary verdicts. What changes is the question it answers and who reacts when it moves.

This graduation path is also why the capability/regression label has to live on the judge and not in someone's head: labels that only exist in tribal knowledge don't survive the handoff. A judge built by the hillclimbing team in March gets read by the on-call engineer in November, and the on-call engineer needs to know that 96% is an alarm, not a pretty good score.

Then go build the next capability judge. The hill you just finished climbing was never the last one.

Key takeaways:
- Every judge is either a capability judge or a regression judge. Pick one before you write the rubric, never both.
- Capability judges start low and climb; if one opens at 95%, you measured something you already had.
- Regression judges are born from fixed production bugs, sit near 100%, and page someone when they dip.
- Both goals want binary pass/fail per case. 1-5 scores smuggle in ambiguity neither job can afford.
- An unlabeled suite that blends the two makes "are we getting better?" unanswerable. Tag every judge, split the dashboard.
- Saturated capability judges do not retire. They graduate into regression judges.

Further reading:
- Ben Hylak, "How to Eval": Goal: https://www.howtoeval.com/#goal

---

## 06. Scoring (https://shipfastevals.com/scoring)
How do I grade the outputs?

There are 3 ways to score the outputs of a session, trace, span, or event. In order of cost they are Deterministic (Code-based graders), Probabilistic (ML-based graders), and Human (carbon-based graders).

- **Deterministic graders** are fast, cheap, easy to debug, and reproducible. But they often are overly specific and work best at the most granular levels of measurement
- **Probabilistic graders** are more expensive than code-based graders. In return for the added cost, these graders are more flexible, more scalable, and can handle nuance. They typically require human calibration before implementation.
- **Human graders** are mostly the most accurate, however, they are limited in scalability and are the most expensive. Furthermore not all human graders have the right exposure/context to accurately measure differences in agent outputs

Most evaluation suites consist of some combination of all three types. Lean on deterministic graders whenever possible as these are the easiest to produce and check.

### The three grader families

Every verdict in your suite comes from one of three grader families, listed here in order of cost.

**Deterministic graders** are code. They come in two flavors: **heuristics** (regex matches, exact-match checks, execution checks like "does the SQL run", "does the JSON parse", "did the agent actually call the refund tool") and **statistics**, distribution checks that catch population-level drift: response lengths suddenly doubling, refusal rates creeping up, tool-call counts collapsing. Deterministic graders are fast, nearly free, debuggable line by line, and perfectly reproducible. Their weakness is literal-mindedness: they grade exactly what you wrote, which is rarely exactly what you meant.

**Probabilistic graders** are models grading models. The two main species are **trained classifiers**, binary (toxic or not, on-topic or not) and multi-class (which of your seven failure categories does this trace fall into), and **LLM-as-a-judge**, where you hand a model the trace plus a rubric and ask for a verdict. They handle nuance and scale to anything you can describe in a prompt, but they cost real money per trace and need human calibration before you can trust a number they emit.

**Human graders** are the gold standard and the bottleneck. Expensive, slow, and (the part people forget) not interchangeable. A crowdworker can grade tone; only someone with domain context can tell you whether the agent's tax answer is subtly wrong. The wrong human is just a very expensive probabilistic grader.

[Flowchart: Which grader does this judge need?]

### When each grader wins

Picking a grader is mostly a cost question: use the cheapest family that can actually see the property you care about.

| Family | Cost per trace | Wins when | Watch out for |
| --- | --- | --- | --- |
| Deterministic | Effectively free | The property is checkable in code: format, execution, exact values, tool calls | Overfitting to surface form; going stale when output formats change |
| Probabilistic | Cents | The property needs judgment but follows a rubric you can write down | Drift, bias toward verbose answers, uncalibrated confidence |
| Human | Dollars and days | Calibrating the other two, novel failure modes, final arbitration | The wrong humans, fatigue, disagreement between graders |

Lean deterministic whenever possible. A code-based check is the easiest grader to produce, the easiest to verify, and the only one that doesn't need a calibration loop of its own. If you can phrase a property as an assertion, do it. Save the LLM judge for properties code genuinely can't reach.

In practice, every real suite blends all three. Deterministic checks run on every trace in CI. LLM judges cover the judgment calls: tone, faithfulness, instruction-following. Humans calibrate the judges and spot-audit the results. And failure modes tend to migrate down the cost ladder over their lifetime: a human notices the problem, an LLM judge formalizes it, and once you understand the failure well enough, you compile it into fifty lines of Python and run it for free forever.

### Information asymmetry beats capability asymmetry

When an LLM judge underperforms, the default instinct is to swap in a bigger model. Most of the time that's the wrong lever.

**Judge quality comes from information asymmetry more than capability asymmetry.** A judge that sees the golden answer, the rubric, the full trace, and the tool-call ground truth will beat a smarter judge flying blind. Concretely: a same-class model judging with a reference answer in hand usually beats a frontier model judging reference-free, and costs a fraction as much. Checking a response against a known-good answer is a far easier task than deciding from scratch whether a response is good.

This is also why **reference-free judges** are the hardest family to calibrate. A judge with no golden answer, no execution result, no ground truth, just a rubric and vibes, has to be roughly as good at the task as the agent it's grading. At that point you haven't built an eval; you've built a second agent and declared it the referee.

The same principle shows up elsewhere in ML: on-policy distillation works because the teacher grades the student's own outputs token by token: dense feedback on exactly the behavior being trained, and grading is a much easier job than generating. Supervision quality tracks what the supervisor gets to see.

So before reaching for a bigger judge, inventory the privileged context you could hand the cheaper one: the golden answer from your dataset, the execution result, the actual tool responses, the user's eventual resolution. Each one converts an open-ended judgment into a comparison. And comparisons are what models grade well.

### Eval scores are samples, not truths

Run the same eval suite twice and you'll get two different numbers. Sampling temperature, nondeterministic inference backends, flaky judges, which examples you drew. All of it injects noise. The problem isn't the noise; it's reading a single pass rate as the truth instead of as one sample from a distribution.

Anthropic's "A statistical approach to model evaluations" lays out the fix, and it's ordinary statistics applied with unusual discipline. Treat each eval question as a draw from an underlying distribution, compute the standard error, and put **error bars** on every score you report. A dashboard that says 84% is hiding information; one that says 84% plus or minus 4 tells you what you can actually conclude.

When comparing two variants (model A vs model B, prompt v3 vs v4) use **paired comparisons**: run both on the same questions and analyze the per-question differences. The bulk of the variance, mostly question difficulty, is shared between variants, so pairing cancels it, and effects that look like noise in unpaired numbers become clearly significant.

The companion concept is the **minimum detectable effect (MDE)**: the smallest real improvement your eval can reliably distinguish from noise at your sample size. If your suite has 100 examples and the error bars span 8 points, a 2-point "win" is unreadable: your instrument cannot see effects that small. Compute the MDE before sizing your dataset, not after: decide the smallest regression you need to catch, then work backwards to how many examples that requires.

### Pass@K vs Pass^K

Two metrics, one keystroke apart, measuring opposite things.

**Pass@K** asks: out of K attempts, did at least one pass? It rises as K grows, and it measures the **capability ceiling**: can the system do this at all, given retries? It's the right lens for workflows with selection built in: code generation with reranking, best-of-n sampling, anything where a human picks the winner.

**Pass^K** asks: did all K attempts pass? It falls as K grows, and it measures **reliability**. The arithmetic is brutal. A step that passes 90% of the time gives you Pass@3 of 99.9%, and Pass^10 of about 35%. The same system is simultaneously "nearly perfect" and "fails most users," depending on which metric you read.

Agents in production need Pass^K thinking. A hotel-booking agent that's right 95% of the time per booking still wrecks one trip in twenty, and a user who books weekly should expect a failure inside six months. Multi-step workflows compound the same way internally: ten sequential steps at 95% each multiply out to roughly 60% end-to-end, before anyone retries anything.

Report both. Pass@K tells you whether the capability exists somewhere in the distribution, whether more prompting, tools, or selection could harvest it. Pass^K tells you whether to ship. Demos run on Pass@K; production runs on Pass^K, and confusing the two is how a flawless demo becomes a support-ticket queue.

### Pin everything

An eval you can't reproduce is a rumor with a decimal point. Before you trust any score delta, pin every input that could move the number:

- **Model versions.** Exact snapshot ids, never floating aliases. Providers swap what "latest" points to, and your three-point regression turns out to be their upgrade.
- **Sampling parameters.** Temperature, top-p, and seeds where the API offers them. A judge at temperature 1.0 is a different judge every run.
- **Prompts.** Agent prompts and judge prompts, version-controlled together. A one-line rubric tweak shifts pass rates as much as a model swap.
- **Datasets.** Version your golden set. Quietly appending ten hard examples moves every historical number and nobody remembers why.
- **Environments.** Sandbox images, API fixtures, mocked tool responses, anything the agent touches during the run.

The judge deserves extra paranoia. Upgrade the judge model and every historical score silently changes meaning, because the grader moved while the work stayed still. When you do upgrade, rerun the new judge over frozen historical traces and re-baseline before comparing anything across the boundary.

The test is simple: rerun last month's eval and get last month's number, within error bars. If you can't, you can no longer distinguish model regressions from harness drift, and your trend charts are archaeology rather than measurement.

Key takeaways:
- Three grader families in cost order: code, models, humans. Use the cheapest one that can see the property.
- Lean deterministic wherever a property can be phrased as an assertion; real suites blend all three.
- Judge quality is information asymmetry, not capability asymmetry. Hand the judge the answer key before buying a bigger model.
- Eval scores are samples, not truths. Error bars and paired comparisons before celebrating, and know your minimum detectable effect.
- Pass@K measures whether it can; Pass^K measures whether it will. Production agents live and die on Pass^K.
- Pin model versions, temperature, prompts, and datasets. An eval you cannot reproduce is a rumor.

Further reading:
- Anthropic: A statistical approach to model evaluations: https://www.anthropic.com/research/statistical-approach-to-model-evals
- Thinking Machines: on-policy distillation (Kevin Lu): https://x.com/thinkymachines/status/1982856272023302322

---

## 07. Building Judges (https://shipfastevals.com/building-judges)
Which judges deserve to exist?

There are two trains of thought when it comes to identifying relevant judges: Top-Down and Bottom-Up opportunity identification. Each can be used throughout the eval creation process, but they vary in effectiveness. Bottom-Up (error analysis) is best used when you already have production traffic to read; Top-Down (spec-driven) is best used when you are building ahead of the data, before anything has shown you where it breaks.

Once you have identified the opportunities, you must decide the type of judge to build and the granularity that will give you the greatest leverage. This step combines the eval goal and measurement grain decisions from earlier: a goal x grain matrix shows the best positions for actionable judges, and plotting your whole suite on it can reveal gaps or weaknesses. Something I have seen teams do at Meta is build a very comprehensive suite, yet it is only comprehensive along one axis or another. Not all sections of the matrix need to be filled for every opportunity, and identifying which judges are necessary is more art than science, but using FAST as our rubric helps filter out the weak candidates.

### Two ways to find a judge

Every judge candidate comes from one of two places: your data or your spec.

**Bottom-up** identification starts from real traces. You read what your product actually did, notice the ways it failed, group those failures, and propose judges for the groups that matter. The judge exists because the failure exists. You have the receipts.

**Top-down** identification starts from intent. You read the PRD, the system prompt, the tool definitions, the policy doc, and ask: what is this product supposed to do, and what would it look like to do it wrong? The judge exists because the requirement exists, even if nothing has violated it yet.

Neither direction is sufficient on its own. Pure bottom-up means your eval suite is a museum of last week's incidents; you're always one step behind whatever production surprises you next. Pure top-down means you're testing the brochure instead of the product, beautifully covering requirements while users hit failure modes nobody wrote down. Bottom-up keeps you honest. Top-down keeps you ahead.

In practice the two feed each other. Error analysis surfaces a failure cluster, which makes you realize the spec was silent on something, which makes you write the requirement down, which generates a top-down judge for the next feature before it ever ships. The direction a candidate came from doesn't decide whether it deserves to exist; that is a separate filter we'll come to last. But it usually shapes what kind of judge it becomes, and that is the next thing to pin down.

### Bottom-up: error analysis

The bottom-up process has a name, **error analysis**, and an evangelist. Hamel Husain has spent years beating one drum: look at your data. Not dashboards about your data. The actual traces.

The mechanics are unglamorous. Pull a sample of traces. Fifty to a hundred is a fine start. Read each one and write a short, open-ended note about anything wrong: "agent promised a refund it can't issue", "re-asked for the order number it already had", "quoted a price without checking availability". No taxonomy yet; just notes.

Then cluster. Similar notes become named failure modes, and suddenly you have counts. This is where intuition gets corrected: the failure everyone remembers from the angry customer email is often rare, while some quiet failure (the support bot silently dropping half of multi-part questions) turns out to be in a fifth of your traces. Counting beats vibes.

Only the clusters that matter become judge candidates. A cluster with three occurrences and no real consequence doesn't graduate. A cluster that's frequent, severe, or both becomes a real candidate, one that still has to earn its slot before it ships.

The most common eval failure isn't a badly built judge. It's a suite built without this step at all. Teams write judges for the failures they imagine instead of the ones they have, then wonder why the suite is green while users churn. Reading traces is the cheapest, highest-leverage activity in all of evals, and it stays that way forever: error analysis isn't a setup phase, it's a habit.

### Top-down: judges from the spec

Top-down judges are derived from what the product is supposed to do, and they're how you eval something with no production data: a new product, a new feature, a new tool the agent just grew.

The raw material is anything that encodes intent: the PRD, the system prompt, tool definitions, escalation policies, compliance rules. Walk through each commitment and turn it into a falsifiable check. A hotel-booking agent whose spec says "always confirm dates and party size before booking" and "never quote a price without checking live availability" has just handed you two judge candidates, and the product hasn't served a single user yet.

Top-down judges have a distinct character. They're **falsifiable from the spec**: you don't need an observed failure to define what failure looks like, because the requirement itself defines it. That makes them your early-warning system: they catch the violation the first time it happens, instead of the fifty-first time when error analysis finally surfaces the cluster.

Their weakness is the mirror image. Because they're born from imagination rather than evidence, some of them will guard against failures that never materialize. That's not free: every judge costs build time, run cost, and calibration attention. This is why variance is one of the gates a candidate has to clear, and why the question doesn't stop there: a top-down judge that has never once failed in three months of real traffic is a candidate for retirement or a tighter rubric, not a trophy.

So run both motions: spec-driven judges before launch, error analysis forever after. Every candidate from either direction still has to earn its place, but first it needs a label: what kind of judge is it?

### Regression judges and capability judges

Every judge worth building lands in one of two classes.

A **regression judge** prevents an identified, fixed error from coming back. It's almost always bottom-up: error analysis found the failure, somebody fixed it, and the judge now stands guard. The support bot used to promise refunds outside policy; you patched the prompt; the regression judge makes sure no future prompt edit, model swap, or retrieval change quietly reintroduces it. Its natural state is passing. That's not the useless kind of low variance: you can name the exact historical trace that would flip it, so there's a real failure it's still watching for.

A **capability judge** asserts something the product should be able to do, falsifiable from the spec. It's almost always top-down: the booking agent must confirm party size, the SQL agent must produce queries that execute against the real schema. Capability judges are where headroom lives. They're the ones you hillclimb against.

Both classes render **binary verdicts**: pass or fail, no 1-to-5 scales (the goals spoke makes the full case). Binary verdicts force falsifiability: a 3 out of 5 can't be proven wrong, but a fail on "did the agent check availability before quoting" can be settled by reading the trace. They also make aggregates mean something. An average score of 3.7 hides everything; an 84% pass rate means 16% of traces failed in a specific way you can pull up and read, one by one.

The classification isn't bookkeeping. It tells you how to interpret each judge's pass rate, and it's one axis of the matrix that shows you the shape of your whole suite.

### The Grain x Goal matrix

You weigh judge candidates one at a time, but suites fail as a whole. The tool for seeing the whole is a grid: **measurement grain** on one axis, **goal** on the other.

Grain is the unit a judge renders its verdict on, the altitude decision from the grains spoke:

| Grain | What it covers | Example judge |
| --- | --- | --- |
| Session | A whole multi-turn conversation | Did the user's issue get resolved by the end? |
| Trace | One full run for a single user turn | Did the agent's answer cite a retrieved document? |
| Span | One step inside a trace | Was the tool call schema-valid? |
| Event | One action inside a span | Does the SQL the agent wrote execute? |

Goal is the classification from the previous section: capability or regression.

Now plot every judge in your suite on this grid. Two pathologies jump out immediately.

**Gaps.** An empty cell is uncovered surface. No regression judges at the span level means a tool integration can silently break and the first you'll hear of it is a vague dip in coarser-grained scores, days later, with no pointer to the cause.

**Crowding.** Five session-level capability judges all sitting near 100% is ceremony, not measurement. They overlap, they've stopped varying, and they're burning run cost to tell you what you already know. Crowded cells are where you graduate saturated capability judges to regression duty, retire the redundant ones, and sharpen what's left into something with headroom.

Reviewing the matrix quarterly is cheap (it's a spreadsheet) and it turns "do we have good evals?" from a feeling into a picture.

The matrix tells you which cells deserve coverage. It doesn't tell you whether a particular candidate is worth building, and most candidates aren't. That is the final filter, and it is deliberately brutal.

[Flowchart: Positioning every judge in your suite]

### The judge-worthiness gauntlet

A candidate judge, wherever it came from, must pass four gates, in order. Most candidates should die here. That's the point: every judge you ship is a permanent line item of run cost and calibration attention, and the suite gets less trustworthy with every judge that doesn't earn its slot.

**Gate 1: Severity x Frequency.** Is this failure worth the cost of a judge at all? A failure that double-charges customers once a week clears the bar instantly. A formatting quirk in 0.01% of traces does not. If the product of how bad and how often doesn't justify build cost plus per-trace run cost plus upkeep, stop here.

**Gate 2: Variance.** Could this judge ever fail? If you can't describe a realistic trace that would flip the verdict, the judge measures nothing, a smoke detector with no battery. Note the test is *could* flip, not *does*: a judge that mostly passes can still clear this gate, but one that could never fail cannot.

**Gate 3: Actionability.** When it fires, can anyone do anything with the output? A verdict should point toward a prompt section, a tool definition, a retrieval step, somewhere. If the failure report would just say "be better", stop.

**Gate 4: Deterministic check.** Before you reach for an LLM judge, ask whether code can do the job. Did the SQL execute? Is the JSON valid? Did the agent call the refund tool without an approval flag? String matching, schema validation, and tool-call inspection are nearly free, perfectly reproducible, and never need calibration. The LLM judge is the fallback for genuinely fuzzy properties, not the default.

**The survivors are your suite.** Everything that makes it through already has a home: a goal and a grain, a single cell on the matrix from the last section. What's left is to actually build the thing, wiring the judge to your traces and your harness, which is the next spoke.

[Flowchart: Which judge should you build, or should you build one at all?]

Key takeaways:
- Judges come from two places: your traces (bottom-up error analysis) or your spec (top-down). You need both.
- Read your data. Clustered, counted failures beat the failures you imagine every time.
- Every candidate runs the gauntlet in order: severity x frequency, variance, actionability, deterministic-check-first.
- If no realistic trace could flip a judge's verdict, it measures nothing. That is what the variance gate kills.
- Prefer code-based checks; an LLM judge is the fallback for genuinely fuzzy properties, not the default.
- Plot every judge on the grain x goal matrix: gaps are uncovered surface, crowding is dead weight.

Further reading:
- Hamel Husain, "Frequently Asked Questions (And Answers) About AI Evals": https://hamel.dev/blog/posts/evals-faq/

---

## 08. Eval Design (https://shipfastevals.com/eval-design)
How do I actually build the judge?

Building the judge is as simple as synthesizing all the decisions you have made up until this point (Measurement grain, Eval goal, and Grader Type) then identifying how to wire the data from your traces to your evaluation harness.

There are a few more decisions you need to make before this runs automatically. Input design, scoring design, and harness x environment design are all prerequisites to building a good judge. Without these decisions in place, you will never know if you are measuring the agent, the model, the harness, or some messy combination of all three. In some cases you may end up with a weird variation of Wittgenstein's ruler, where you end up measuring the judge instead of the agent entirely.

Knowing where each of your evals stand is important as well. Eval cards capturing relevant information like the failure/success mode you are looking to measure, motivating traces, measurement unit, and pass criteria are useful for quickly re-orienting yourself when you come back to each judge as your eval suite expands.

### What exactly gets judged?

Every eval starts with the same question: what object does the judge actually look at? One span: a single tool call, one generated SQL query? A whole trace: the agent's complete attempt at a task? A session spanning several tasks? A diff: the change an agent made to a codebase? This is the **measurement unit**: the grain decision from the grains spoke, made concrete.

Getting it wrong is the most common way evals fail silently. You write down "the agent books hotels correctly," then build a judge that only reads the final chat message, and miss that the agent booked the right hotel for the wrong dates, because the confirmation message looked fine. Property and unit have to match: "writes executable SQL" is a span-level property, so judge the span. "Answered the analytics question" is trace-level. "Resolved the support issue without the user re-contacting" is session-level, and no amount of squinting at a single trace will measure it.

The unit determines everything downstream: what your input cases look like, what ground truth means, and what the grader is allowed to see. That's why it's the first box in the pipeline. Change it later and you're not adjusting the eval, you're rebuilding it.

[Flowchart: From property to shipped judge]

### Where the cases come from

Cases come from three places, and a healthy dataset usually mixes all three.

- **Real traces** mined from production. Maximum realism, zero authoring cost, but biased toward what users already do, which means biased toward the happy path. Users quietly route around the things your agent is bad at, so the failures you most need to test are underrepresented in prod data.
- **Hand-written cases.** Expensive per case, but the only way to guarantee a specific scenario exists: the sold-out hotel, the ambiguous date ("next Friday" said on a Thursday), the user who changes their mind mid-booking.
- **Synthetic variations.** Take a real or hand-written case and perturb it: shift the dates, change the party size, swap the city for one with no availability. Cheap multiplication of coverage, as long as you spot-check that the perturbations stay realistic.

The design principle that matters more than the source: every case should stress the property under test, and ideally nothing else. If the eval is "agent checks availability before booking," every case needs an availability wrinkle. Cases that also stress tone, currency conversion, and a flaky retrieval step don't test more; they make failures unattributable, which kills actionability.

And budget deliberately for edge cases. Sample uniformly from production and you'll get 95% happy path, and the eval will confidently report that your agent handles the easy stuff. You already knew that.

### Ground truth: golden answers, rubrics, or neither

Ground truth is whatever the grader compares the output against. Three strategies, in descending order of preference.

**Golden answers**: the exact expected result. The correct query output, the specific booking record, the right refund amount. Use these whenever the task has one verifiable right answer; nothing is cheaper to grade against or harder to argue with.

**Rubrics**: written criteria for when many outputs are acceptable. A support reply can be phrased a hundred ways, but it must state the refund deadline, must not promise compensation, and must not invent policy. The rubric pins the must-haves and leaves the phrasing free.

**Reference-free**: the judge grades the output with nothing to compare it to. This is a last resort, and the scoring spoke explains why: a good judge needs an information advantage over the model being graded. A reference-free judge knows nothing the policy model didn't, so it grades what it can see: fluency, confidence, formatting. That's how you end up with a judge that loves wrong answers delivered with conviction.

Two questions teams skip: who writes the ground truth, and who keeps it alive. The answer to the first is the person who would file the bug, a domain expert, not whoever had a free afternoon. The answer to the second is nobody, unless you make it someone: golden answers rot every time the product changes (new cancellation policy, new tool, new schema). Version the ground truth with the dataset and put a name next to it.

### Designing the grader

The scoring spoke covers which grader to pick: code, model, or human. This is about setting up whichever one you picked so it actually works.

**Deterministic graders**: make the check sharp and narrow. One assertion per check: the SQL executes, the JSON parses against the schema, the booking row exists with the right dates. The temptation is to pile assertions into one grader while you're in there. Resist it: ten assertions in one check is a composite score with extra steps, and when it fails you're back to grepping logs to find out why.

**Human graders**: brief them like new hires, not like Mechanical Turkers. That means a written rubric, three or four worked examples of passes and fails with the reasoning spelled out, and an explicit "unsure" option. Without the escape hatch, ambiguous cases get coin-flipped and your agreement numbers look worse than your rubric deserves.

**LLM-as-a-judge**: three rules.

1. The rubric lives in the prompt. If the criteria are in your head and the prompt says "rate the quality," the judge invents its own criteria, differently on every run.
2. Give the judge **privileged context**: the golden answer, the tool logs, the policy doc. A judge that sees only what the model saw is grading blind.
3. Force a binary verdict with a required reason. Pass or fail, plus why. Scales from 1 to 10 feel more informative but mostly add noise, and the written reason is what makes a verdict falsifiable when you audit the judge later.

### Outcome or process?

There are two things you can grade: where the agent ended up, or how it got there.

**Outcome-based grading** judges the final state. Did the correct booking land in the system, with the right dates and the right room? Did the query return the right rows? How the agent got there is its own business.

**Process-based grading** judges the path. Did the agent check availability before booking? Did it confirm the cancellation policy with the user before charging the card?

| | Outcome | Process |
| --- | --- | --- |
| Judges | the final state | the steps taken |
| Survives | prompt rewrites, model swaps, new strategies | very little: it pins the current implementation |
| Catches | failures that show up in the result | failures the result hides |
| A failure means | something broke, somewhere | this specific step broke |

Outcome grading is more robust: the agent can reorganize its entire approach and the eval still measures the right thing. Process grading is more actionable: when it fires, you know exactly which step to fix. The cost is brittleness: pin the path too tightly and the eval starts punishing legitimate improvements. And once step-compliance becomes the metric you hillclimb, Goodhart shows up: the agent gets optimized to perform the prescribed steps whether or not they still serve the outcome.

Default to outcome grading wherever the final state is checkable. Add process checks for the things outcomes can't see: safety-critical steps (did it confirm before charging the card?) and lucky guesses: an agent that skips the availability check and gets lucky passes every outcome eval right up until the day it doesn't.

### Environment vs harness

Two pieces of machinery get conflated constantly, and the conflation produces evals that lie.

The **environment** is the world the agent acts in: the tools it can call, the sandbox it runs in, the data it operates on: the fake hotel inventory, the seeded database, the mock payment API. The **harness** is everything wrapped around that world: the code that spins up an episode, injects the task, enforces timeouts, collects the trace, and hands it to the grader.

The environment should be as realistic as you can afford: same tool definitions as production, data with the same mess in it. The harness should be boring: deterministic, fast, and invisible in the results.

Conflate them and you get evals that test the harness instead of the agent. The sandbox is flaky, the seed data has no hotels in the requested city, a tool times out, and the run gets recorded as the agent failing. Now your pass rate measures infrastructure weather. Anthropic's agent-eval guide flags the same trap: infrastructure flakiness produces correlated failures that look like agent regressions, and a 0% pass rate usually means a broken task, not an incapable agent. Track environment and harness problems separately from real failures: a task the world made impossible tells you nothing about the agent.

The practical test before reading any numbers: could a perfect agent pass every case? If not, fix the environment first.

### Pressure-test it, then write the card

A judge you haven't attacked is a judge you're trusting on vibes. Before any eval gates a release, **pressure-test** it:

- **Feed it known-good cases**: traces a domain expert already blessed. Every false failure here is calibration debt you'll pay later in ignored alerts.
- **Feed it known-bad cases**: real failures from production, plus hand-built ones. A judge that has never seen a true failure has an unknown catch rate, which is the same as no catch rate.
- **Try to fool it.** A long, confident, well-formatted answer that's wrong. A terse correct one. A response that quotes the rubric back at the judge. If any of these flips the verdict, you've found what the judge actually measures.
- **Hunt for shortcut features.** Correlate verdicts with surface features: length, politeness, presence of a summary section. If verbosity predicts passing, you've built a verbosity detector with a quality-shaped name.

Then write the **eval card**, a one-pager that ships with the eval, the suite's equivalent of a model card:

- What property it measures, and at what grain
- Which goal it ties back to
- Grader type and where the ground truth came from
- Dataset provenance and size
- Known blind spots: everything the pressure test exposed
- An owner, with a name

The card exists for the person six months from now staring at a pass-rate dip, deciding whether to trust the number or the model. Without it, that decision is archaeology. With it, it's a two-minute read.

[Flowchart: A filled-in eval card]

Key takeaways:
- Pick the measurement unit first. Property and grain have to match, and everything downstream depends on it.
- Every input case should stress the property under test. Production traces alone over-represent the happy path.
- Golden answers beat rubrics beat reference-free. A judge with no information advantage grades on style.
- LLM judges get the rubric in the prompt, privileged context, and a forced binary verdict with a reason.
- Default to outcome grading for robustness; add process checks for what the final state cannot show.
- Attack every judge before trusting it, then ship it with an eval card: property, grain, grader, blind spots, owner.

Further reading:
- Anthropic: Demystifying evals for AI agents: https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents

---

## 09. Judge Calibration (https://shipfastevals.com/calibration)
Is this judge actually measuring what we want it to?

This is essentially the question of internal validity from experimentation reframed for Agents. Trusting that your judges are actually measuring what you need can help when determining whether to prioritize fixing your evals versus pivoting on a product decision. Documenting and communicating this kind of calibration is especially useful when you inevitably find a low scoring regression judge. In cases where the judge is not trusted, the conversation typically devolves into a finger-pointing, loudest-voice competition where the agent builders believe the judge is wrong and the eval builders believe the product is wrong. Played incorrectly, this can burn trust in the evals process completely.

What separates token burning from useful judges typically comes down to how much effort you are willing to put into calibrating your judges. Pressure-testing your created evals should be considered a hard and fast rule for any judge. This can be as simple as manually looking through a few traces or as complex as hiring an independent team to score hundreds of sessions. Documenting the level of rigor and scores can help communicate the level of trust in the judge much more precisely than a binary yes/no. In practice, the comprehensiveness of calibration should scale with the severity of the opportunity (e.g., a flagship capability should likely have more intense judges/pressure testing than a deterministic length checker).

### Why calibrate at all

Your judge is an LLM grading another LLM. Nothing about that arrangement guarantees the grades mean anything. The judge will happily produce confident verdicts with consistent formatting and plausible reasoning, none of which tells you whether it's catching the failures you actually care about. An **uncalibrated judge** is a random-ish number generator with good vibes.

**Calibration** is the fix, and the idea is simple: take traces that trusted humans have already labeled, run the judge on the same traces, and measure how often they agree. The judge is a model, so you evaluate it like one, against ground truth.

Skip this step and you get a familiar failure pattern. The team ships a hallucination judge, the dashboard goes green, everyone relaxes, and support tickets about made-up refund policies keep arriving. The judge was measuring *something* (maybe "does the response cite a source", maybe "does it sound confident"), just not the thing on the label. Now you have two problems: the original failure mode, plus a dashboard actively telling you it doesn't exist.

This spoke is the third stage of the loop for a reason. You defined what to measure, you designed a judge to measure it. Calibration is where you find out whether the judge you built is the judge you designed. Until then, every pass rate it produces is a rumor.

### The input side: spot checks and labeling passes

Calibration needs human labels, and there are two ways to get them: one cheap and continuous, one structured and occasional.

The cheap one is **spot checking**: regularly pull a sample of traces and read them next to the judge's verdicts. Twenty or thirty a week is plenty. You're doing error analysis on the judge itself. Every disagreement between your read and the judge's verdict is a calibration data point, and the pattern of disagreements usually points straight at the broken part of the rubric. Most judge bugs are found this way, by a human going "wait, that's not a hallucination." There is no substitute, and there is nothing cheaper.

The structured one is a **labeling pass**, and the big labs run these as standing programs. You don't need their headcount; you need the shape:

- A **written rubric**, ideally the same one your judge prompt is built from, so humans and judge are grading the same thing.
- **Trained labelers** who've worked through example traces together before labeling solo. "Trained" can mean two engineers and an hour of arguing over ten traces.
- **Overlap**: have multiple people label the same subset so you can check inter-rater agreement (more on that below).
- **Disagreement resolution**: when labelers disagree, fix the rubric so the next person wouldn't.

The output is a labeled set you trust. That set is the ground truth everything in the next two sections gets computed against.

### The output side: the confusion matrix family

Once you have labeled traces, judge calibration becomes an ordinary classification problem, and the **confusion matrix** metrics apply directly.

Concrete example: a hallucination judge, run on 100 traces your team labeled by hand. Humans found 40 hallucinations and 60 clean traces. The judge flags 38 traces: 32 are real hallucinations (true positives), 6 are clean traces wrongly flagged (false positives). It misses 8 real hallucinations (false negatives) and correctly passes 54 clean traces (true negatives).

| Metric | The question it answers | Here |
| --- | --- | --- |
| **Accuracy** | How often is the judge right overall? | 86% |
| **Precision** | When the judge fires, how often is it right? | 32 of 38 = 84% |
| **Sensitivity (TPR)** | Of the real failures, how many does it catch? | 32 of 40 = 80% |
| **Specificity (TNR)** | Of the clean traces, how many does it correctly pass? | 54 of 60 = 90% |

Which number you optimize depends on what the judge is for. If it gates releases, sensitivity is king. Every missed failure ships to users. If it files bug reports to engineers, precision is king. False alarms burn trust fast, and a judge that cries wolf gets muted. Most judges need a floor on both, but you should be able to say which one you'd sacrifice first.

One number conspicuously absent from the priority list: accuracy. The next section is about why.

### Base rate neglect, or why accuracy lies

That example had a 40% failure rate, which made the arithmetic friendly. Production isn't friendly. Real failure rates for a specific failure mode are often 1-5%, and at those rates, headline accuracy becomes actively misleading. This is **base rate neglect**, and it bites nearly every team the first time.

Walk the numbers. Your product produces 1,000 traces; 2% contain the failure: 20 bad traces, 980 clean. Your judge is genuinely good: 95% sensitivity, 95% specificity. Roughly 95% accurate. Sounds shippable.

Now watch it run. It catches 19 of the 20 real failures, great. But it also wrongly flags 5% of the 980 clean traces: 49 false alarms. So the judge fires 68 times, and it's right 19 of them. **Precision: 28%.** A "95% accurate" judge that's wrong nearly three times out of four when it fires. Your engineers triage the first dozen flags, find mostly noise, and quietly stop reading the channel. Two weeks later the judge catches something real and nobody looks.

What to do about it:

- **Report precision at the operating point**, not accuracy. "When this fires, it's right X% of the time" is the number people actually experience.
- **Push specificity hard.** At a 2% base rate, each point of specificity removes about 10 false alarms per 1,000 traces; each point of sensitivity recovers 0.2 real failures. The leverage is lopsided.
- **Match the workflow to the precision.** A 28%-precision judge is a fine triage filter feeding human review. It is not an auto-filer of bug tickets.

### Agreement metrics: Cohen's kappa

Raw percent agreement has the same disease as accuracy: it's inflated by chance, badly so when labels are skewed. At a 2% failure rate, a "judge" that stamps PASS on everything agrees with humans 98% of the time while catching zero failures. **Cohen's kappa** corrects for this: it measures agreement beyond what you'd expect from chance given how skewed the labels are, where 0 means no better than chance and 1 means perfect. That all-PASS judge scores exactly 0.

Rough working bands, with the usual caveat that the cutoffs are conventions, not laws:

- **Below 0.4**: the judge is closer to noise than signal. Don't automate anything with it.
- **0.4 to 0.8**: real but unreliable agreement. Read the disagreements and fix the rubric or prompt before trusting it.
- **0.8 and up**: the judge agrees with your humans about as well as careful humans agree with each other. This is the zone where automating the measurement is defensible.

And that comparison point matters more than the bands: **measure human-human kappa first**. Have two trained labelers grade the same traces independently and compute kappa between *them*. If your humans only hit 0.5 against each other, the judge has no target. No prompt engineering will make a model agree with a ground truth that doesn't hold still. That's a rubric problem, or a sign the property isn't falsifiable yet, and it sends you back to the FAST gates rather than back to the judge prompt.

Human-human agreement is your judge's ceiling. A judge whose kappa against humans approaches the humans' kappa against each other is done; past that you're tuning noise.

### The judge calibration loop

Calibration isn't a ceremony you perform once at launch. It's a loop:

1. **Label** a sample of fresh traces with humans, against the written rubric.
2. **Measure agreement**: kappa against the human labels, plus precision and sensitivity at your operating point.
3. **Read the disagreements.** Every judge-versus-human mismatch is a clue. Usually the fix is in the rubric or the judge prompt: an ambiguous criterion, a missing edge case, a few-shot example teaching the wrong lesson.
4. **Fix and re-measure**, on a *fresh* labeled sample, not the one you just tuned against. Tuning and measuring on the same traces is how you overfit a judge and ship the overconfidence.
5. When agreement clears your bar, **automate**, and keep a standing weekly spot-check sample so drift gets caught.

Then recalibrate. On a schedule, sure, monthly, or every N-thousand traces. But the mandatory triggers are events, not dates:

- The **judge's model** changes (provider upgrade, model swap, even a quiet API revision).
- The **judge's prompt or rubric** changes, including "minor wording cleanups".
- The **product shifts under the judge**: new feature, new user segment, an upgrade to the application's own model. New distribution of traces means the old calibration sample no longer represents what the judge sees.

A judge calibrated in March against a product that swapped models in May is measuring last quarter's product. The dashboard will keep updating daily either way, which is exactly the problem.

[Flowchart: The judge calibration loop]

Key takeaways:
- An uncalibrated judge produces confident verdicts that mean nothing. Compare it to human labels before trusting a single pass rate.
- Reading traces next to judge verdicts is the cheapest calibration there is. Do it weekly, forever.
- Sensitivity is the share of real failures caught; precision is how often a firing judge is right. Know which one you would sacrifice first.
- At a 2% failure rate, a 95%-accurate judge is wrong most of the times it fires. Precision at the operating point beats headline accuracy.
- If humans cannot agree with each other, the judge has no target. Measure human-human kappa before judge-human kappa.
- Recalibrate on a schedule, and always when the judge model, the judge prompt, or the trace distribution changes.

Further reading:
- Hamel Husain, "Frequently Asked Questions (And Answers) About AI Evals": https://hamel.dev/blog/posts/evals-faq/

---

## 10. Eval Infrastructure (https://shipfastevals.com/infrastructure)
How do I do this at scale?

Most evals start as a script run locally on someone's machine. Designing the proper eval infrastructure is highly contingent on the type of product that you are building. A simple chat app with input-output could theoretically stay as a simple script forever without much drag from not building a full suite. As your product and your eval harness get more complex, this becomes much less tenable.

The primary questions to keep in mind when building out your infrastructure are related to parity and information asymmetry:

- **Parity**: Does my evaluation harness accurately reflect the agent harness as it currently stands today?
- **Information Asymmetry**: Does my infrastructure insert knowledge that will make it a better judge of the output than simply re-running the agent again?

To borrow from OOP, your evaluation suite should consist of multiple instances of Class Eval which should take in [Judge(s), Dataset(s), Agent Harness] and should output Eval Cards. Creating a reproducible framework makes it easier to maintain and standardizes the judge building methodology. Without this, each judge adds non-trivial maintenance overhead.

In addition to offline infrastructure, it is often useful to maintain online evals. These catch regressions earlier and can be influential in identifying the KPI impacts that judges may be able to serve as proxies for in the future.

If you have the resources, rolling your own eval infrastructure will likely be more useful in the long run. With full control over the entire process, you can customize it to your exact needs without any baggage. This is much more doable with high-powered long-running coding agents like Mythos/Fable that can be (mostly) trusted to keep things secure and up-to-date. That being said, there are many excellent frameworks out there that can serve as great starting points.

### Class Eval: standardize the interface

Treat your suite as many instances of **Class Eval**, not a pile of one-off scripts. Every instance is constructed from the same three arguments: a **judge** (or several), a **dataset**, and an **agent harness**. Run it, and it reports in one standard shape: pass rates, failures, and links to the traces behind each verdict.

The point of the framing is the interface. Your first eval is expensive no matter what: you build dataset loading, episode execution, verdict collection, and reporting from scratch. The real question is whether your tenth eval is also expensive. If every eval is bespoke (its own runner script, its own output format, its own way of finding traces) the marginal cost never drops, and your suite stops growing exactly when your product needs it to grow fastest.

Standardize the constructor and everything downstream gets cheap. A new eval becomes a config change: write the judge, point it at a dataset, reuse the harness. Datasets become swappable. The same judge can run against your golden set nightly and against sampled production traffic continuously. And because every eval reports in the same shape, results are comparable across the suite: a regression in the SQL checker reads the same as a regression in the tone judge.

This is the same bet the RL community is making with standardized environments: Prime Intellect's Environments Hub treats evals and training environments as one interface precisely so they compose. You don't need their infrastructure, but you do want their discipline.

### Golden and synthetic datasets

Every instance of Class Eval needs a dataset, and datasets come in two trust tiers.

**Golden datasets** are small, hand-verified, and expensive per row. Every case has been checked by a human who can defend it. This is the bar for correctness: golden sets calibrate your judges, gate your releases, and settle arguments. Because human verification is slow, they stay small (tens to low hundreds of cases) and that's fine. Their job is to be right, not big.

**Synthetic datasets** are how you scale past what humans can hand-verify, and they come in two flavors with very different risk profiles:

- **Creation** generates cases from scratch: prompt a model to write 200 hotel-booking requests with conflicting constraints. Fast and unlimited, but riskier: the generator invents users who don't exist, asking questions nobody asks, in phrasing nobody uses.
- **Expansion** mutates real traces into variants: take an actual booking request and change the dates, swap the city, add a typo, tighten a constraint. It stays anchored to the real distribution, which makes it the safer default.

Either way, synthetic data needs spot-checked human review on every batch. Skip it and the dataset quietly drifts off-distribution, and from then on your pass rates measure performance on an imaginary product. The numbers will look stable. They'll just be about nothing.

### The offline and online eval stacks

Eval infrastructure splits into two stacks that answer different questions.

The **offline stack** runs before users see anything: episodes executed against datasets, inside sandboxes, on demand or on a schedule. It answers "did this change make the agent better on the cases we know about?" It's controlled and repeatable, the same input produces a comparable run tomorrow, which is exactly what you need for hillclimbing and for catching regressions in CI.

The **online stack** runs where your product runs: signal mined from live traffic, real users, and real consequences. It answers "is the product actually working?" It's noisy and slow but unfakeable: no synthetic dataset argues back the way a user abandoning a session does.

Neither stack is sufficient alone. Offline-only teams overfit to their datasets and ship confidently into failure modes nobody wrote a case for. Online-only teams use their users as the QA department and find out about regressions from churn curves. The two are supposed to feed each other: every interesting production failure becomes an offline dataset case, and every offline judge worth its keep eventually gets attached to sampled live traffic. The next two sections walk through each stack's parts.

[Flowchart: The offline and online eval stacks]

### Inside the offline stack

The offline stack is everything that lets you run episodes before users see them. Four pieces matter.

**Scaffolding.** Your **agent harness** is the code that runs your agent in production: prompts, tools, the loop. Your **eval harness** is the code that runs episodes against a dataset and collects verdicts. Keep them separate, but make the eval harness invoke the real agent harness rather than re-implementing it. The re-implementation foot-gun is everywhere: an eval runner with its own copy of the system prompt, frozen from three weeks ago, quietly measuring an agent you no longer ship. When the harnesses drift, every number your suite produces is about the wrong agent.

**State and sandboxes.** Agents mutate things: they write rows, send emails, file tickets. Every episode needs a fresh, hermetic world: a seeded database, mocked or sandboxed external services, a throwaway filesystem. Without isolation, episode 14 fails because episode 13 left a reservation in the table, and you burn an afternoon debugging a bug that doesn't exist.

**Distribution.** Episodes are independent, so fan them out in parallel. A suite that takes four hours gets run weekly; a suite that takes ten minutes gets run on every change. Parallelism is the cheapest eval-quality investment you'll make.

**Scheduling.** Two cadences cover most teams: a nightly full suite over everything, and a per-PR smoke suite: a fast, high-signal subset that gates merges. The smoke suite catches the regression before it lands; the nightly run catches whatever the smoke suite missed.

### Mining production for signal

Offline evals tell you about the cases you thought of. Production tells you about everyone else. Four sources, roughly in ascending order of volume:

**A/B testing** is the ground-truth eval. Ship the change to a slice of traffic and measure real outcomes: bookings completed, tickets resolved, sessions retained. It's expensive, slow, and needs enough traffic to reach significance, which is exactly why you can't use it for everything. But when offline numbers and A/B results disagree, the A/B is right.

**Semantic signals** are users grading your agent in plain text: "that's wrong", "no, I said Tuesday", "let me talk to a human". Mine your message logs for these. They're free labels on real failures, written by the people your evals exist to satisfy. Users are sometimes wrong about the facts, but they're never wrong about being frustrated.

**Action signals** beat sentiment. A regenerate click, a session abandoned three steps into a booking, an answer copied and then immediately rewritten. Behavior is honest in a way feedback forms aren't. The users most worth hearing from quietly leave without typing anything.

**Product observability** closes the loop: attach eval verdicts to live traffic. Run judges (sampled, see the economics below) over production traces, so that when something fails it arrives in your tracing tool already labeled, sitting next to the exact trace that produced it. Tools like Raindrop are built around this pattern: production traffic treated as a continuously judged dataset.

### Eval economics

Judges cost money per trace, and production traces arrive by the million. A judge at $0.002 per trace sounds free until you multiply: five judges across a million daily traces is $10,000 a day. Tractability doesn't stop at build time: the same T in FAST that decided whether a judge was worth building now decides whether you can afford to keep running it.

Three levers keep the bill sane:

- **Sample asymmetrically.** Judge 100% of suspected failures (they're rare, and they're the point) and sample successes at a few percent, enough to track the pass rate without paying for every confirmation that things are fine.
- **Gate with deterministic checks.** Run cheap code-based graders first: did the SQL execute, did the response contain a confirmation number, did the agent actually call the refund tool. Only traces that clear the cheap tier and still look ambiguous earn an LLM judge's attention.
- **Right-size the judge model.** Most judges don't need a frontier model. A small model with a tight rubric, calibrated against your golden set, often matches the big model's agreement rate at a tenth of the cost.

Across all three levers the goal is the same: spend judge budget where a verdict can change a decision, and nowhere else. A judge you can't afford to run is a judge you don't have.

Key takeaways:
- Every eval is an instance of Class Eval: judge + dataset + agent harness in, one standard report out. Get the interface right and the next eval gets cheaper.
- Golden data sets the bar; synthetic data scales it. Spot-check every synthetic batch or it quietly drifts off-distribution.
- Keep the eval harness separate from the agent harness, and make it call the real one. Drift between them means measuring an agent you do not ship.
- Agents mutate state, so every episode gets a fresh, hermetic world. Leftover state turns real signal into phantom bugs.
- In production, behavior beats sentiment: regenerate clicks and abandoned sessions are eval signal, not just product metrics.
- Judge cost times traffic is real money. Gate with deterministic checks, judge all failures, sample successes.

Further reading:
- Prime Intellect: Environments Hub: https://www.primeintellect.ai/blog/environments
- Ben Hylak / Raindrop: How to Eval: https://www.howtoeval.com/

---

## 11. Communicating Results (https://shipfastevals.com/communicating-results)
How do we communicate what we have learned?

Much like traditional Data Analytics roles, finding the information is only half of the battle. The results of your Evals must be communicated at varying levels of nuance and length to stakeholders all over your organization. To simplify this, make use of dashboards and integrate alerting as early as possible. There is nothing worse than running into something failing in production that should have been caught by evals days ahead of time.

One failure mode that I have seen is a tendency for non-technical stakeholders to "reinvent evals from first principles". They tend to see product metrics as the golden truth and evals as an inferior substitute. Unfortunately for LLM-based products, doing enough root cause analysis will require them to push for more and more complex metrics until the only solution becomes building LLM judges. In these cases it would have been simpler and more useful to start with the Eval framework we have established here.

### The deliverable is a decision

You did everything right. Judges built, calibrated against human labels, running on production traces, pass rates landing in a database. And then the numbers sit in a dashboard with four monthly visitors, and the product stays exactly as broken as it was. An eval suite nobody reads is expensive decoration.

The fix is a reframe: the deliverable is not a number, it is a **decision someone can make from the number**. Every metric you publish should come with an implied reader and an implied reaction. A regression judge dips, and the on-call engineer investigates. A capability judge saturates, and the team graduates it and picks the next hill. A new failure mode shows up in the error analysis, and someone writes a judge for it. If you can't name who reacts to a metric and what they do when it moves, you haven't built communication. You've built telemetry, and telemetry without a reader is a write-only database.

This is the test to run on every readout in this spoke. Not "is this number accurate?": that was the job of calibration. The question here is "did this number cause anyone to do anything?" The rest of this page covers the four channels where that handoff happens: dashboards, CI gates, monitoring of the evals themselves, and the humans on the other end.

### Two dashboards, never one

The goals spoke split every judge into **capability** or **regression**. Your dashboards have to honor that split, because the two kinds of numbers demand opposite readings.

A **regression dashboard** should be boring. Every panel near 100%, all green, and wired to alert on dips. Nobody should be studying it; the dashboard studies itself and pages someone when a line moves. If your regression dashboard is interesting, you have an incident.

A **capability dashboard** is the opposite: trend lines climbing toward goals, meant to be read weekly, by humans, looking for slope. Flat lines are the news here: a capability judge that hasn't moved in six weeks means your changes aren't touching that capability.

And never blend the two into one composite health score. A regression judge dropping from 99 to 94 (an incident) and a capability judge climbing from 64 to 69 (a good week) average out to a flat line that says nothing happened.

Two more requirements for either dashboard:

- **Slice it.** A single top-line pass rate hides everything useful. Slice by judge (which property moved), by grain (did sessions degrade while spans held), and by segment (your support bot can be flat overall while the enterprise segment quietly tanks and free-tier improves). The top-line number answers "are we okay?"; the slices answer "where do I look?"
- **Show error bars.** With a few hundred cases behind a judge, a 1-point move is usually noise. Plot the confidence interval, not just the point estimate, so a wiggle inside the band doesn't get read as signal. A team that celebrates +1 one week and panics at -1 the next is reading the same noise twice and reacting both times.

### Evals in CI: gates, not suggestions

Your regression suite is the unit-test analogue for model behavior, and it should run like one: on every change that can alter what the product does. That list is longer than people expect: prompt edits, tool definition changes, model version upgrades, retrieval tweaks, temperature changes. A prompt edit feels like a copy change. It is a code change to the most load-bearing code you have, and it merges ungated at your peril.

Running every judge on every PR is usually too slow and too expensive, so tier it:

- **Smoke suite, per PR**: your highest-severity regression judges on a small case set. Cheap graders where possible, minutes not hours. This is the merge gate.
- **Full suite, nightly**: everything, all judges, full datasets, the expensive LLM-judged cases. Failures here open tickets and land on the regression dashboard by morning.

The non-negotiable part: the per-PR gate **blocks**. A warning gate trains people to ignore it: the first yellow banner gets investigated, the tenth gets scrolled past, and by the twentieth your suite is background noise that costs money. If a regression judge fails, the merge stops, exactly like a failing unit test.

When the gate fires, there are only two honest outcomes. Either it caught a real regression, and you fix the change. The gate just paid for itself. Or the judge's verdict is wrong, and you fix the judge: update the rubric or case, recalibrate, and note why. What you never do is toggle the gate off to ship. Every bypass teaches the team the gate is optional, and an optional gate is a warning with extra steps.

[Flowchart: When the merge gate fires]

### Eval observability: who watches the judges?

Your evals are production software, and they fail like production software. A suite with no monitoring of its own degrades silently: the dashboard keeps rendering, the numbers keep updating, and what they describe drifts further from reality. Three meters to keep on the suite itself:

**Latency.** Slow judges back up the pipeline. If judging can't keep pace with trace volume, traces queue, and then they get dropped, often silently, and often the longest, gnarliest traces first, which are exactly the ones most likely to contain failures.

**Error rates.** Judges crash. APIs time out. The judge model returns something your parser can't read. Every errored trace silently shrinks your sample, and the dashboard doesn't show the shrinkage. It shows a pass rate over whatever survived. A judge that errors on 30% of traces and passes 95% of the rest is not reporting 95%. It's reporting 95% of a biased remainder, because the traces that crash a judge (long ones, malformed tool outputs, weird encodings) correlate heavily with the traces that fail it.

**Sampling rates.** Almost nobody judges 100% of production traffic; cost forces a sample. Fine, but know the fraction, and know whether it's biased. If you judge 5% of traces and the sampler skips long sessions to control spend, your numbers describe a product your heaviest users never see.

The common thread: every one of these failures makes the dashboard look healthier, not sicker. Dropped traces, errored judges, and skewed samples all remove hard cases from the denominator. Green is not the same as good.

### The human layer

The last hop is the one where most suites die: the handoff from dashboard to human. Three habits keep it intact.

**One definition per metric, shared.** If the PM, the engineer, and the on-call each carry a private definition of "resolution rate" (solved in one session? user didn't return for a week? judge said resolved?), they will argue fluently about a number that means three different things. Every metric name should link to exactly one written definition: the rubric, the grain, the dataset. When someone asks "what does this measure?", the answer is a link, not a meeting.

**Label capability vs regression everywhere the number travels.** Not just on the dashboard: in the Slack alert, the weekly email, the launch review slide. The same 96% is an alarm on a regression panel and a strong quarter on a capability panel, and a reader who can't see the label will pick the flattering interpretation.

**Ship the eval card with the number.** An eval card is a short standing document per judge: what it measures, at what grain, the dataset behind it (size, source, last refresh), calibration stats against human labels, known blind spots, and an owner. The habit that matters: whenever a number leaves the dashboard (exec email, launch doc, board slide) the card goes with it. A number traveling without its card invites the reader to imagine what it means, and they will imagine wrong, usually in whichever direction the meeting needed.

This is the report stage of the loop for a reason. The readout is where measurement either becomes work on the product or becomes a PDF.

Key takeaways:
- The deliverable is a decision, not a number: every metric needs a named reader and an implied reaction.
- Regression dashboards should be boring and alarmed; capability dashboards should trend. Never average them together.
- No error bars, no interpretation: a 1-point move on a few hundred cases is usually noise.
- CI gates block, never warn. A warning gate trains everyone to scroll past it; a bypassed gate is a warning with extra steps.
- Monitor the evals themselves: judge latency, error rates, and sampling bias all make dashboards greener while making them mean less.
- One shared definition per metric, capability-vs-regression labels everywhere, and the eval card travels with the number.

---

## 12. Iterations (https://shipfastevals.com/iteration)
How do we use evals to improve our product?

Connecting everything back to the product is possibly the most important piece within this guide. Ideally, this should create a loop where feedback from the evals creates opportunities to improve the product which creates opportunities to get more feedback from evals and so on. This can look like experimentation, recursive self-improvement, or something as simple as automatic prompt automation.

The real magic happens when this loop gets tighter and faster. The goal here is to make meaningful improvements based on your eval outcomes every week if not every day. Deeply integrating evals into your product development lifecycle can help accelerate this loop even faster.

### The steering wheel, not the report card

It's easy to build an eval suite and then use it like a report card: run it weekly, screenshot the dashboard, feel good (or bad), change nothing. That's the most expensive way to own evals: you pay the full cost of building and calibrating judges and collect almost none of the value.

The suite's real job is to answer one question, over and over: **of these two versions of the agent, which one should exist?** An absolute score is nearly meaningless on its own. "87% pass rate" tells you nothing about whether your new prompt is an improvement; "87% versus the baseline's 82%, on the same cases, outside the error bars" tells you exactly what to do next.

This reframing changes how you treat every number. A score isn't a grade to be proud of; it's one side of a comparison waiting for its other half. If a judge's output never changes a decision about what ships, the judge is decoration. And it changes what "done" means: an eval suite isn't finished when the dashboard is green; it's finished when it's the default way your team settles arguments about what to change.

Every spoke before this one was about making the measurements trustworthy. This one is about the payoff: pointing those measurements at candidate changes and letting them drive.

### The offline experimentation loop

The core loop is dull on purpose:

1. **Propose a change.** A reworded system prompt, a new tool definition, a different model, tighter retrieval chunking, anything you suspect might help.
2. **Run the suite** on the candidate and the baseline, same cases, same judges.
3. **Compare with error bars.** A 2-point bump on 50 cases is one flipped case, noise. Both runs share the same cases, so compare paired: put the interval on the per-case difference, not on each score separately. If zero sits inside that interval, you don't have a result yet.
4. **Ship or discard.** Either way, you learned something. Log it.

What this buys you is the upgrade from opinion to evidence. "I think the new prompt handles refunds better" starts a debate; "the refund-policy judge went from 71% to 90%, well clear of the noise" ends one. Judges are how prompt engineering stops being vibes.

Two habits keep the loop honest. First, change one thing at a time when you can. If you swap the model and rewrite the prompt together, a regression has two suspects. Second, watch how often you peek. Every time you check the suite, tweak, and check again, you leak a little information about the test set into your changes. Do it fifty times and you've quietly overfit to your eval cases: the suite says you improved, production says otherwise. Keep a held-out slice you only touch before shipping. The foot-guns spoke covers how badly this goes when ignored.

[Flowchart: The eval-driven improvement loop]

### Closing the loop: judges as optimization signal

Once judges are programmatic, a human doesn't have to be the one proposing changes. Judge scores are a signal any optimizer can climb.

The simplest version is **automatic prompt optimization (APO)**: an outer loop generates prompt variants, your suite scores them, the best survive. GEPA-style evolutionary search takes this further (mutate prompts, evaluate, select, repeat) with your eval as the literal fitness function. These methods routinely find prompts no human would write, because they're searching a space humans get bored in after four attempts.

One level up are **auto-research loops**: an agent reads failing traces, hypothesizes a fix (a prompt edit, a tool description tweak, a new few-shot example), applies it, and lets the suite render a verdict. Run that overnight and you wake up to a ranked list of candidate improvements with scores attached. It's the experimentation loop from the previous section with the human moved from operator to reviewer.

The governing rule: **the stronger your judges, the more autonomy you can hand the loop.** With sloppy judges, automated optimization is a machine for manufacturing regressions that look like wins. And the pressure is asymmetric: a human iterating on prompts stumbles into a judge's blind spot occasionally; an optimizer running thousands of evaluations will find every blind spot and move in. This is Goodhart's law with a search algorithm behind it. Reward hacking isn't a hypothetical failure mode of frontier labs; it's what happens to your refund-policy judge on iteration 600.

### Evals are environments

Look at what a good agent eval already contains: a task definition, a sandboxed place for the agent to act (a mock booking API, a scratch database, a containerized filesystem), and a **verifier** that checks the outcome. Now look at what an RL environment needs: a task, a place to act, and a reward signal. Same parts. A well-built eval is most of an RL environment with the training loop left off.

The labs have stopped treating these as separate artifacts. Prime Intellect's verifiers framing makes it explicit: an environment is a task plus a verifiable reward, and the same object serves evaluation, RL training, and synthetic data generation. Write it once, use it for all three.

You may never run RL, and this still matters, because building evals as environments forces three properties you want anyway:

- **Executable.** The eval is code that runs end to end, not a spreadsheet of transcripts someone has to interpret.
- **Reproducible.** Sandboxed state means the same case yields the same setup every run, so regressions are real, not flaky fixtures.
- **Reusable.** A new model drops, you point the environment at it, and you have comparable numbers in an hour instead of a quarter.

And if the day comes when you do want to fine-tune or run RL against your product's tasks, your eval suite stops being a cost center and becomes the head start. The teams with strong verifiers are the teams that can train.

### Grade the agent you have, not the one you wish you had

A subtle way improvement loops rot: grading curated transcripts instead of the agent's own behavior. You hand-write a tidy ten-turn conversation, drop the agent in at turn eight, and score its reply. Clean, reproducible, and **off-policy**. Your agent never sees turn eight of that conversation in production, because its own turns one through seven look nothing like your script. You're measuring performance on a distribution the agent doesn't live in.

Off-policy evals flatter the agent you wish you had. The hand-written history quietly avoids the mess your agent creates for itself: the ambiguous tool result it half-parsed at turn three, the wrong assumption it confidently carried forward. Production failures are usually compounding errors, and curated transcripts compound nothing.

The training world learned this the hard way. On-policy distillation (grading and correcting a student model on trajectories it sampled itself, rather than on the teacher's transcripts) beats imitating curated data, because the student gets feedback exactly where its own behavior goes wrong. The eval lesson is the same: signal is most useful on the states your agent actually reaches.

Practically: seed eval cases from real production traces, not imagined ones. Let the agent generate its own full trajectories during evaluation instead of splicing it into golden histories. Refresh cases as the agent changes, since fixing one behavior shifts which states it reaches next. The agent in your eval should be recognizably the same animal as the one in production, including the limp.

Key takeaways:
- An eval suite exists to choose between candidate changes. A score on its own decides nothing; a comparison with error bars decides everything.
- Every change (prompt, tool, model, retrieval) goes through the same loop: run the suite against the baseline, compare, ship or discard.
- Peeking at the suite to iterate slowly overfits your agent to your eval cases. Keep a held-out slice you only touch before shipping.
- Programmatic judges can drive APO and auto-research loops, but optimizers route through your weakest judge. Autonomy is bounded by judge quality.
- A good eval with a sandbox and a verifier is most of an RL environment. Build it that way even if you never train.
- Grade on-policy. Curated transcripts flatter the agent you wish you had; production-shaped trajectories grade the one you actually ship.

Further reading:
- Prime Intellect, "Environments Hub": https://www.primeintellect.ai/blog/environments
- Thinking Machines, "On-Policy Distillation": https://thinkingmachines.ai/blog/on-policy-distillation/

---

## 13. Foot-guns (https://shipfastevals.com/foot-guns)
What are the common fail-cases?

A key idea here is that evals are really the easy-bake oven version of reinforcement learning. We build out harnesses, environments, and prompts then find ways to score the behavior of the model when exposed to those conditions. But we just stop there. Whereas RL must find nuanced, complex technical tricks to suck that information through a straw in thousands of rollouts, eventually achieving meaningful updates at the model-weight level, we can simply update prompts or add tools or simply start over at the harness level. But the close analogue does mean we can borrow a lot of the brilliance from there in order to make our lives easier. This means a lot of the footguns that apply to RL and traditional machine learning disciplines can apply here too. This section maps out many of the most common ones.

### Why eval suites fail silently

An eval suite that lies is worse than no suite at all. With no suite, you know you're flying blind and act accordingly: you read traces, you stay paranoid. With a lying suite, you stop looking. The dashboard is green, the release gates pass, and failure modes pile up in production where nobody is grading them. You paid for confidence and you got confidence. You just didn't get truth.

The nasty part is that suites rarely break loudly. A judge that crashes gets fixed the same afternoon. A judge that quietly starts measuring the wrong thing can run for months, passing everything, while the agent it was built to check drifts out from under it. Silent failure is the default failure mode of evals, which is why this spoke exists.

The foot-guns below sort into three families:

- **Validity threats**: the judge works fine, but it's measuring the wrong thing.
- **Judge quality threats**: the judge itself is broken or biased.
- **Operational threats**: everything technically works, but the suite rots until nobody runs it.

For each: what it is, how it bites, and the counter-move.

### Validity threats: measuring the wrong thing

**Reward hacking**, **Goodhart's Law**, **judge hacking**, and **objective mismatch** are four names for the same underlying trap: when a measure becomes a target, it stops being a good measure. Your agent (or the team hillclimbing it) gets very good at making the judge happy, which is not the same as making users happy. A support bot learns that an apology plus a numbered list passes the helpfulness judge, so every response becomes an apology with a numbered list. Counter-move: keep held-out judges the team doesn't optimize against, and periodically check that judge scores still correlate with outcomes you actually care about.

The rest of the family is drift in various directions:

| Foot-gun | How it bites | Counter-move |
| --- | --- | --- |
| **Eval–harness drift** | The eval runs a stale copy of your harness, testing an agent that no longer exists | Run evals through the same code path production uses |
| **Overfitting to the suite** | Hillclimbing the same 200 cases until they're effectively memorized | Hold out cases; refresh from production traces |
| **Distributional shift** | Production traffic moves, the dataset doesn't (a.k.a. data drift, covariate shift) | Re-sample from live traffic on a schedule |
| **Sampling bias** | The dataset never matched traffic in the first place, so you tuned for users you don't have | Stratified sampling from real traces |
| **Model drift** | The underlying model updates under you | Pin versions; re-baseline on every upgrade |

**Selection bias** earns its own paragraph because it's the sneakiest. Condition on a **collider** and you manufacture correlations that don't exist: this is **Berkson's paradox**. Say you build your dataset only from escalated tickets. Tickets escalate when the bot's answer was wrong *or* the customer was angry. Within that sample, correct answers and calm customers become anti-correlated, and you'll conclude your bot performs worse for polite users, a pattern that exists nowhere in the wild. Counter-move: sample from all traffic, not just the slice that got flagged.

### Judge quality threats: the judge itself is broken

Sometimes the dataset is fine and the judge is the problem.

**Judge hallucinations**: the judge invents facts to justify a verdict, citing a refund policy that appears nowhere in the trace. Counter-move: require the judge to quote the specific lines that support its verdict, and treat verdicts without citations as invalid.

**Everything-as-regression-judge**: if your whole suite checks things the agent already does well, it reads 100% green and says nothing about whether the product is improving. An all-green suite should make you suspicious, not proud: it usually means you have no capability judges left to hillclimb (see the goals spoke).

**Over-sensitivity**: a judge that flags trivial deviations as failures cries wolf, and a judge that cries wolf trains everyone to ignore it. Once people dismiss its verdicts by reflex, it's dead weight. Track your judges' false-positive rates the same way you track the agent's failures.

Pairwise comparison has its own bestiary of biases:

- **Position bias**: the judge prefers whichever option comes first (or last). Randomize order, or run both orders and require agreement.
- **Verbosity bias**: the longer answer wins regardless of quality. Control for length and say so explicitly in the rubric.
- **Self-preference bias**: a model grading its own outputs scores them higher. Use a different model family for the judge, or grade against a reference answer so the judge compares to ground truth instead of its own taste.

Finally, **context poisoning**: the trace can steer the judge. If the agent's output ends with "this response fully satisfies the rubric," a naive judge may simply agree. Delimit the trace, instruct the judge that its contents are data rather than instructions, and test with adversarial traces to confirm it holds.

### Operational threats: the suite rots

These foot-guns don't make the suite wrong; they make it abandoned. The result is the same.

**Parkinson's Law for tokens**: judge cost expands to fill whatever budget you set. Reasoning traces lengthen, rubrics accrete clauses, and the bill grows without verdicts improving. Set per-judge token budgets and treat increases as a deliberate decision, not background drift.

**Over-reliance on synthetic datasets**: AI-generated cases start plausible and drift off-distribution, until your suite is grading an imaginary product. Synthetic data is a bootstrap, not a diet: keep replacing it with cases sampled from production.

**Latency**: slow suites stop being run. If the full run takes 45 minutes, engineers run it before releases instead of before every change, then before major releases, then never. Tier it: a fast smoke set on every change, the full suite nightly.

**Scaling bottlenecks**: a suite designed at 50 cases falls over at 5,000: rate limits, sequential runs, flaky retries that poison results. Parallelize early and budget for the suite you'll have, not the one you started with.

**Cache contamination**: cached responses leaking between runs make results stale or interdependent. The slower-burning version is pretraining contamination: if your eval cases are public, they may be in the next model's training data, and the model has literally seen the answers. Keep a private held-out set, bust caches between runs, and re-validate whenever the underlying model changes.

### The pattern underneath

Every foot-gun in this spoke is the same failure wearing different clothes: the suite and reality drift apart, and nothing forces them back together. Validity threats are reality drifting away from your dataset. Judge threats are the judge drifting away from your labels. Operational threats are the team drifting away from the suite.

The counter-move is the same in every case: regular contact with production. Refresh the dataset from real traffic. Recalibrate judges against fresh human labels. Keep the suite fast and cheap enough that people actually run it. The same error-analysis loop that built your suite is what keeps it honest.

A suite you built once and never re-grounded isn't measuring your product. It's measuring your memory of it.

Key takeaways:
- A green dashboard is a claim, not a fact. An eval suite that lies is worse than no suite.
- When a measure becomes a target it stops measuring. Keep held-out judges and check scores against real outcomes.
- Beware Berkson: evaluating only escalated tickets manufactures correlations that do not exist in the wild.
- Treat the trace as untrusted input: demand citations from your judge and randomize pairwise order.
- Pin model versions and re-baseline on every upgrade: "latest" is a moving target.
- Slow suites stop being run. Tier them, and re-ground everything in production traces on a schedule.

Further reading:
- Hamel Husain, "Frequently Asked Questions (And Answers) About AI Evals": https://hamel.dev/blog/posts/evals-faq/

---

## Site Navigation

/ - The guide (hub page with all section summaries)
/what-to-measure - Evaluations
/fast - FAST Evals
/observability - Observability
/grains - Measurement Grains
/goals - Eval Goals
/scoring - Scoring
/building-judges - Building Judges
/eval-design - Eval Design
/calibration - Judge Calibration
/infrastructure - Eval Infrastructure
/communicating-results - Communicating Results
/iteration - Iterations
/foot-guns - Foot-guns
/quiz - Knowledge check: 10-15 questions on the guide
/llm.txt - This file
Contact: https://theryanhartman.com/contact