Section 04 · How does altitude affect our evals?
Measurement Grains
Sessions, traces, spans, events. Pick the altitude before you build the judge.
Four altitudes, one conversation
Take one support interaction and slice it four ways. A user opens a chat with your billing agent, asks why they were double charged, gets an explanation, asks for a refund, and gets one. Same interaction, four grains:
- Session: the entire multi-turn conversation, from "why was I charged twice?" to "refund issued, anything else?" A session-level question: did the user leave with their problem solved?
- Trace: everything triggered by one user turn, also called a trajectory. The turn "can I get a refund?" might kick off five agent steps: look up the account, check refund eligibility, query the payments database, call the refund API, draft the reply. All of that is one trace.
- Span: a single step within that trace. The database query is a span. The refund API call is a span. A span-level question: did the eligibility check use the right account ID?
- Event: an individual action inside a span. Within the database-query span, the SQL string the agent wrote is an event, and so is the parsed result. An event-level question: does this SQL even execute?
Nothing about the interaction changed between these four views, only the altitude you're measuring from. That choice matters because it determines everything downstream: what a "failure" means, what kind of grader can detect it, what it costs to run, and who can act on the result. Pick the grain before you write a single line of rubric.
The value vs. actionability tradeoff
Coarse grains measure what users actually feel. If your session-level resolution rate climbs from 70% to 85%, customers notice, support tickets drop, and somebody's dashboard turns green. The catch: when a session-level judge fails, you've learned that somewhere in a forty-turn conversation, across twenty traces and a hundred spans, something went wrong. Good luck assigning that to an engineer.
Fine grains flip the tradeoff. "The agent passed a string where the refund API expects an integer" is a bug report you can fix before lunch. But each individual event-level failure is low-stakes on its own: the API call gets retried, the user never notices, and a 2% malformed-call rate might not be why your resolution rate is stuck at 70%.
So neither end of the spectrum is sufficient. Coarse judges tell you whether the product works; fine judges tell you what to change. The connective tissue is decomposition: when a session fails, you want trace- and span-level judges already in place so you can localize the failure instead of re-reading transcripts. A suite that only measures sessions produces meetings. A suite that only measures events produces a wall of green checkmarks on a product nobody likes using.
The cost and complexity tradeoff
Grain choice also decides what kind of grader you can get away with, and what it costs to run on every trace.
| Grain | Example check | Typical grader | Cost per check |
|---|---|---|---|
| Event | Does this SQL execute? | Deterministic code | Effectively free |
| Span | Is the tool call schema-valid? | Deterministic code | Effectively free |
| Trace | Did the agent resolve the user's request? | LLM-as-a-judge | Model call(s) |
| Session | Did the conversation stay coherent across turns? | LLM-as-a-judge, long context | Expensive model calls |
At the event and span level, "correct" is usually checkable by a program: run the SQL, validate the JSON against the tool schema, regex the date format. These checks are fast, deterministic, and never need calibration: a wrong verdict is a bug in your checker, not a judgment call.
At the trace and session level, "correct" stops being mechanical. Whether the agent actually resolved the request depends on context, phrasing, and what the user meant, nuance that only an LLM judge can classify, and even then only after you've calibrated it against human labels. You're paying a model call (sometimes several, over long contexts) per evaluation, plus the ongoing cost of keeping the judge honest.
The practical consequence: deterministic span and event checks are cheap enough to run on 100% of production traffic, while session-level judges often run on a sample. Build accordingly.
Which grain does the failure live at?
When you find a failure mode worth measuring, the first design question isn't "what's the rubric?". It's "what's the smallest grain where this failure is fully visible?"
Work it from the bottom up. Malformed tool call? That's visible in a single event. Build a deterministic event check and stop. Agent queried the wrong table? Visible in one span. Agent answered the question but ignored half of what the user asked? You need the whole trace, because no single span is wrong: the failure is in what's missing. Agent contradicted something it said ten turns ago, or kept re-asking for the order number? Only a session-level judge can see that, because the evidence is spread across turns.
Measuring above the failure's natural grain wastes money and dilutes the signal: a session judge that exists to catch SQL errors is an expensive, noisy way to do what a one-line check does perfectly. Measuring below it is worse: the judge literally cannot see the failure, and you get confident green checkmarks on broken behavior.
This is also why real suites end up with judges at several grains. Your failure modes don't all live at one altitude, so your judges can't either.
Flowchart coming soon
Picking a measurement grain
How grains interact with the FAST gates
Grain choice and the FAST gates pull on each other, mostly in one direction: the coarser the grain, the harder the gates get.
A session-level "conversation quality" judge is the canonical multi-gate failure. It flunks Specific because quality could mean tone, accuracy, latency, or formatting, several judges in a trench coat, again. It flunks Actionable because a score drop points at nothing in particular. And it strains Falsifiable, because two reviewers can read the same conversation and disagree about whether it was "good." None of this means session-level measurement is hopeless; it means session-level judges have to be scoped to one property that genuinely lives at that altitude: "did the agent re-ask for information the user already provided?" is session-level and passes all four gates.
Fine grains have the opposite profile. Span and event checks are falsifiable and specific almost by construction: the SQL executed or it didn't. Their gate to watch is Tractable in reverse: checks this cheap are tempting to build by the dozen, and fifty event checks nobody triages is its own failure mode.
The pattern that works: a small number of carefully scoped coarse judges that track what users feel, decomposed into cheap fine-grained checks that tell you what to fix. The coarse layer finds problems; the fine layer assigns them.
If you remember nothing else
- 01 Four grains, one interaction: sessions hold traces, traces hold spans, spans hold events. Pick the altitude before writing the rubric.
- 02 Coarse grains measure what users feel but resist action; fine grains are instantly fixable but individually low-stakes.
- 03 Spans and events are usually checkable with free deterministic code; traces and sessions usually need a calibrated LLM judge.
- 04 Measure at the smallest grain where the failure is fully visible: coarser is wasteful, finer is blind.
- 05 A session-level "quality" judge fails Specific by default. Scope coarse judges to one property that genuinely lives at that altitude.
- 06 Real suites run judges at several grains: coarse judges find problems, fine judges assign them.