Section 03 · Can you trace what happened?

Observability

You can't judge what you never logged. Capture the causal chain first.

Log it before you can measure it

Every eval in this guide assumes one thing that almost nobody sets up first: the run was captured. A judge reads a trace and renders a verdict, but if the trace was never logged, there is nothing to read. You can have the sharpest rubric in the world and it grades thin air.

Deterministic software lets you cheat here. If a function misbehaves, you re-run it with the same inputs and watch it misbehave again, because the same inputs always produce the same execution. Agents don't grant you that. The same prompt can take a different path on every run, the agent mutates state as it goes, and the model version under you can change next Tuesday. The run is not reproducible from its inputs, which means the run itself is the artifact. If you didn't record it as it happened, it's gone, and so is any chance of judging it.

This is why observability comes before measurement grains, before scoring, before any judge. It is the substrate the rest of the loop stands on. The teams that struggle hardest with evals are usually not the ones with bad judges; they're the ones who went to build a judge, opened their logs, and found the one field they needed was never written down.

Capture the chain, not just the events

Traditional logging is event-based: discrete, independent records. "Request received." "Tool called." "Response sent." For request-response software that's enough, because each event stands alone and the next one doesn't much care what the last one did.

Agents break that assumption. Step N's output is step N+1's input. The model reads a tool result, decides what to do next, reads the next result, and on and on. A failure eight steps back doesn't announce itself; it surfaces as a confident, plausible, wrong final answer, and by then the event that caused it is buried in a pile of unrelated events. A flat event log can tell you what happened. It usually can't tell you what caused what, and causation is the whole question.

The fix is to log the structure, not just the moments. The grains spoke lays out the altitudes, sessions hold traces, traces hold spans, spans hold events; observability is what records the edges between them: which span spawned which, what each step consumed, what it produced, where it branched. Stitch those together and a trace stops being a timeline and becomes a tree you can walk from a bad outcome back to the decision that caused it. Across long, multi-turn sessions that walkability is the difference between "the agent got worse this week" and "the agent started trusting a stale retrieval result on turn three."

Flowchart coming soon

Tracing the causal chain

Events nest into spans, spans into traces, traces into sessions, linked by what each step consumed and produced, so a bad outcome can be walked back to its cause.

Put the judge where the signal is

Once the chain is captured, you get to choose where on it a judge looks, and the choice decides how early you find out something broke.

Hang every judge off the final response and you only ever catch failures at the end, after the agent has already acted on the mistake. Put a judge on the span where the decision actually happens and you catch the failure at its source. Take a research agent that retrieves a poisoned document and then reasons confidently from it: a judge on the retrieval span flags the bad source the instant it lands, while a judge that only reads the final answer sees a fluent summary and shrugs. Same failure, two very different warning times.

That's the practical reading of the idea that judges placed at the right part of the chain catch the effects of a change long before users do. A regression doesn't have to ride all the way out to a user complaint; a span-level judge sitting next to where the behavior lives can trip the moment a deploy changes it. The constraint is the obvious one: you can only judge a step you logged as its own distinct span. Instrumentation granularity sets the ceiling on judge placement, so the decision of what to capture in this stage quietly bounds every judge you'll be able to build later.

What to actually capture

"Log everything" is the right instinct and useless as instruction. Concretely, a trace you can eval needs:

  • The real inputs. The full system prompt, the user message, and whatever context got injected, retrieved chunks, memory, tool schemas. Not a summary of them.
  • The raw outputs. What the model actually emitted, including the parts the UI hides. The cleaned-up chat bubble is where failures go to look fine; judge the raw output, not the rendered one.
  • Every tool interaction. Tool name, the exact arguments the agent passed, and the exact result it got back. Most agent failures are tool-call failures, and they're invisible if you only logged that a tool was "called."
  • The intermediate state. Reasoning traces, scratchpad steps, plan revisions, the points where the agent changed its mind. This is where process-level judges live.
  • The stitching IDs. A trace id, span ids, and parent pointers, so the chain from the previous section can actually be reconstructed instead of guessed at from timestamps.
  • The reproducibility metadata. Model version (the exact snapshot, never "latest"), prompt version, temperature, and timestamps. The scoring spoke's "pin everything" rule starts here: a score you can't attribute to a known configuration is a rumor.

The test is simple: could you replay this run, or explain exactly why it failed, from your logs alone? If not, you have telemetry, not observability, and the gap is precisely the failures you didn't think to capture.

Instrument before you need it

Observability has an asymmetry that punishes procrastination: you can add a judge to last month's traces, but you cannot add logging to them. The data either exists or it doesn't. By the time a failure mode is interesting enough to measure, the traces that would have characterized it have already streamed past ungrabbed.

So the cost of under-instrumenting isn't paid today; it's paid the day you go looking for a signal and discover the relevant field was never written. And it's paid worst on exactly the failures you most want, the surprising ones, because those are by definition the ones you didn't build a bespoke log line for in advance. Rich, structured capture from the start is what lets error analysis (the bottom-up half of building judges) find failures you never predicted.

This is also the substrate online evals run on. Catching a regression in production before your support queue does means judges reading live traces as they happen, and that only works if production is already emitting full, chain-linked traces. Wire the instrumentation in early, decide your sampling and retention deliberately (you rarely need to keep 100% of traffic forever, but you do need an honest sample of the hard cases), and treat the trace schema as part of the product. Everything downstream in this loop is only as good as what this stage bothered to write down.

If you remember nothing else

  • 01 You can only judge what you logged. Observability is the precondition for every eval downstream, which is why it comes before grain, scoring, and judges.
  • 02 Event logs record what happened; the causal chain records what caused what. Agents need the chain because each step feeds the next.
  • 03 Place judges at the point in the chain where a failure is born, not just on the final output, to catch effects before users do.
  • 04 Capture the raw run: full inputs, tool arguments and results, intermediate state, stitching IDs, and version metadata. Not the rendered UI text.
  • 05 Instrument before you need it. You can't retrofit a trace you never captured, and the failures worth catching are the ones you didn't predict.