A worked example

Anatomy of a Trace

The rest of the guide is mostly principle. This page is one concrete run: a research agent answers a question, every step it takes gets logged as raw telemetry, and then two judges grade the finished run and disagree about it.

The run, captured then judged ↓

Act 1 · capture · live

Research agent
What's the refund window on annual plans?
Planning research…
Searching the knowledge base…
Reading sources…
Writing the answer…

You can refund an annual plan within 30 days of purchase.

Source: refund policy
session sess_4f1a
trace trace_9c2e
plan span
· decompose → "find official refund policy for annual billing"
search(knowledge_base) span
· query → "refund window annual plan"
top_k = 5
· 5 chunks returned
#1 billing-policy · #2 terms · #3 community-wiki · #4 changelog · #5 faq
read · rank span
· read top-ranked chunks
rank → select chunk #3 as primary source
community-wiki · says "30-day refunds"
synthesize span
· draft answer from chunk #3
"…within 30 days…"

It's the same run shown two ways. The left is what the user sees; the right is what actually got logged. Notice that the failure is invisible on the left and only shows up on the right.

the run becomes the artifact → graded offline

Act 2 · grade · offline

session sess_4f1a
trace trace_9c2e
plan span
· decompose → "find official refund policy for annual billing"
search(knowledge_base) span
· query → "refund window annual plan"
top_k = 5
· 5 chunks returned
#1 billing-policy · #2 terms · #3 community-wiki · #4 changelog · #5 faq
read · rank span
· read top-ranked chunks
rank → select chunk #3 as primary source
community-wiki · says "30-day refunds"
synthesize span
· draft answer from chunk #3
"…within 30 days…"
LLM-as-judge trace capability

Grounded answer

“Does the final answer cite a retrieved source for its key claim?”

Whether an answer counts as 'grounded' is a fuzzy call, so this one is an LLM reading the whole trace. That costs a model call, and the verdict only means something once you've calibrated it against human labels.

Pass

The answer does pin its 30-day claim to a retrieved chunk, so by this rubric it's grounded and passes. The catch is that grounded isn't the same as correct: the judge reads the whole trace but has no way to know that chunk #3 is a source it shouldn't trust.

↳ reads the whole trace

Deterministic event regression

Trusted-source guard

“Is every source the agent relied on the official-docs allowlist?”

No model here — it's just code checking the cited source against an allowlist. That makes it basically free, perfectly repeatable, and there's nothing to calibrate.

Fail

Chunk #3 is a community wiki, which isn't on the allowlist, so this fails. We only added the check after the wiki poisoned an answer once before — that history is what makes it a regression judge — and it trips on the exact step where the bad source got picked.

↳ reads the chunk #3 selection event

The takeaway

Both judges read the exact same run, and they disagreed. The capability judge passed it — the answer really did cite a source, it just couldn't tell the source was junk. The regression judge failed it, because it was sitting right where the bad source got picked. In practice, a green checkmark isn't the same as a correct answer, and the judge closest to where things actually broke is usually the one that catches it.

The correct answer was 14 days — per the official Billing Policy (chunk #1).