Offline Evals in Action

Anatomy of a Trace

This anatomy of a trace demonstrates how evals may actually behave in a production system. Stitching events and spans together gives us a causal look at the entire picture.

The run, captured then judged ↓

Act 1 · capture · live

Research agent

What's the refund window on annual plans?

Planning research…

Searching the knowledge base…

Reading sources…

Writing the answer…

You can refund an annual plan within 30 days of purchase.

Source: refund policy

session sess_4f1a

trace trace_9c2e

▸ plan span

· decompose → "find official refund policy for annual billing"

▸ search(knowledge_base) span

· query → "refund window annual plan"

top_k = 5

· 5 chunks returned

#1 billing-policy · #2 terms · #3 community-wiki · #4 changelog · #5 faq

▸ read · rank span

· read top-ranked chunks

⚠ rank → select chunk #3 as primary source

community-wiki · says "30-day refunds"

▸ synthesize span

· draft answer from chunk #3

"…within 30 days…"

On the left is what the user sees; on the right is the raw telemetry, every event and span logged under the trace's master id. The answer reads fine on the left, while the bad source it used only shows up in the chain on the right.

run finished → graded offline

Act 2 · grade · offline

session sess_4f1a

trace trace_9c2e

▸ plan span

· decompose → "find official refund policy for annual billing"

▸ search(knowledge_base) span

· query → "refund window annual plan"

top_k = 5

· 5 chunks returned

#1 billing-policy · #2 terms · #3 community-wiki · #4 changelog · #5 faq

▸ read · rank span

· read top-ranked chunks

⚠ rank → select chunk #3 as primary source

community-wiki · says "30-day refunds"

▸ synthesize span

· draft answer from chunk #3

"…within 30 days…"

LLM-as-judge trace capability

Grounded answer

“Does the final answer cite a retrieved source for its key claim?”

Whether an answer counts as 'grounded' is a fuzzy call, so this one is an LLM reading the whole trace. That costs a model call, and the verdict only means something once you've calibrated it against human labels.

Pass

The answer does use information it got from its tool call. However, the agent didn't look closely enough at the source of the info. Only hard-coded checks would consistently catch this error.

↳ reads the whole trace

Deterministic event regression

Trusted-source guard

“Is every source the agent relied on in the official-docs allowlist?”

This uses a code-based grader which makes the judge cheap and fast relative to probabilistic judges.

Fail

Chunk #3 is from a wiki which is not on the allowlist. This judge was likely built to check for these kinds of context poisoning failures after error analysis revealed its common occurrence.

↳ reads the chunk #3 selection event

Your real eval suite will carry many judges across a wide variety of problems. Being able to dig into why an instance failed will help tighten the feedback loop and improve your agent.

The correct answer was 14 days (per the official Billing Policy).

Grounded answer

Trusted-source guard

Where this lives in the guide