Measurement Grains · FAST Evals

The four grains

Judges typically evaluate an agent's actions at one of four layers of aggregation, or grains. These layers are:

Sessions: The entirety of a multi-turn conversation between a user and the agent. A typical session consists of multiple user and agent turns.
Traces: The transcript of a single user turn (also called a trajectory and may contain multiple agent steps). A typical trace contains everything from the moment a user submits a request until the agent submits its response.
Spans: Each step within a trace. A typical span could be a single tool call within a series of tool calls in the trace.
Events: Each action taken within a Span. The string emitted by the agent when calling a tool would be an event within the span.

These nest within each other like so:

A nesting diagram: a session contains traces, each trace contains spans, and each span contains events — shown on one research-agent run.

Session A multi-turn conversation sess_4f1a

Trace One user turn — a trajectory trace_9c2e

"What's the refund window on annual plans?"

Span One step within the trace search(knowledge_base)

Events query → "refund window annual plan" 5 chunks returned

Span The next step read · rank

Events read top-ranked chunks select source chunk

⋯ plan and synthesize spans, each with their own events

Trace The next user turn — its own spans & events

Each grain contains all of the smaller grains

The choice of grain greatly depends on the type of problem you are dealing with. Answer the question "Where does that failure mode actually live?" and you will find the right grain to measure.

As a rule of thumb, go with the smallest grain that still captures the problem you encountered. Start at the event level and work your way up until the whole problem is visible. This will make judge flags more actionable. Additionally, feel free to build more judges at differing grains to get a good idea of the problem's entire surface area.

The Value vs. Actionability Tradeoff

A key characteristic to be aware of when choosing the grain for your judge is how valuable the judge's results will be. Judges at high grains are higher value because they directly measure things that the user cares about. However, in complex systems, these judges may not be directly actionable since they are so far removed from the individual steps that make up a response.

Moving to finer grains will make your interpretations of judge results more actionable, but at the cost of not knowing how much they will affect the user.

Cost & Complexity Tradeoff

Another characteristic to consider when choosing the altitude of your judge is the usual cost & complexity of that altitude. Coarser-grained judges will tend to cost more to run (you have to send in a great deal more context judging a Session versus a Span) and will require substantially more complexity to wire up. However, they are much simpler to interpret in terms of things the end user cares about. There is no one right way to determine your mix of grains, but it is worth keeping this tradeoff in mind when it comes to optimizing your budget.

Remember anything that requires a model call will also require human calibration which adds coordination costs into the mix.

Here is a rough heuristic for the cost of different grains:

Grain	Example check	Typical grader	Cost
Event	Does this SQL execute?	Deterministic code	Near zero
Span	Is the tool call format valid?	Deterministic code	Near zero
Trace	Did the agent resolve the request?	LLM-as-a-judge	A model call
Session	Did the conversation stay coherent across turns?	LLM-as-a-judge, long context	Several model calls