A worked example
Anatomy of a Trace
The rest of the guide is mostly principle. This page is one concrete run: a research agent answers a question, every step it takes gets logged as raw telemetry, and then two judges grade the finished run and disagree about it.
The run, captured then judged ↓
Act 1 · capture · live
You can refund an annual plan within 30 days of purchase.
It's the same run shown two ways. The left is what the user sees; the right is what actually got logged. Notice that the failure is invisible on the left and only shows up on the right.
the run becomes the artifact → graded offline
Act 2 · grade · offline
Grounded answer
“Does the final answer cite a retrieved source for its key claim?”
Whether an answer counts as 'grounded' is a fuzzy call, so this one is an LLM reading the whole trace. That costs a model call, and the verdict only means something once you've calibrated it against human labels.
The answer does pin its 30-day claim to a retrieved chunk, so by this rubric it's grounded and passes. The catch is that grounded isn't the same as correct: the judge reads the whole trace but has no way to know that chunk #3 is a source it shouldn't trust.
↳ reads the whole trace
Trusted-source guard
“Is every source the agent relied on the official-docs allowlist?”
No model here — it's just code checking the cited source against an allowlist. That makes it basically free, perfectly repeatable, and there's nothing to calibrate.
Chunk #3 is a community wiki, which isn't on the allowlist, so this fails. We only added the check after the wiki poisoned an answer once before — that history is what makes it a regression judge — and it trips on the exact step where the bad source got picked.
↳ reads the chunk #3 selection event
The takeaway
Both judges read the exact same run, and they disagreed. The capability judge passed it — the answer really did cite a source, it just couldn't tell the source was junk. The regression judge failed it, because it was sitting right where the bad source got picked. In practice, a green checkmark isn't the same as a correct answer, and the judge closest to where things actually broke is usually the one that catches it.
The correct answer was 14 days — per the official Billing Policy (chunk #1).