Observability · FAST Evals

Imagine your agent operating in a 2d plane. Where deterministic software results in the same path to your output every time you provide it an input, agents can dynamically generate new paths every time they are called, even with the same input.

For your evaluation suite to be successful, you need to log as much of your agent's surface area as possible.

Deterministic path vs. three agent runs — All 4 runs resulted in a visible output. However, the paths that the agent took to get there varied greatly from run to run.

Capture the entire causal chain

When building out your evals, it is best to start by focusing on enhancing what you can record, rather than how you will judge it. Just like many of the other value-creating parts of evals, putting the right logs in the right place is more art than science. At the minimum you should attempt to capture the inputs, outputs, and any metadata that you think materially affects your agent's behavior or the outputs.

Log Everything

Logging everything is exceptionally useful as an axiom and is exceptionally useless in practice. Concretely you should picture your agent and its behavior as an onion.

Each layer builds upon the last and offers more resolution at the cost of increased complexity.

At the outermost layer are your session logs: What are all the inputs that the agent began its life with? What were the user messages, the system prompt, and the tools available to the agent? What was the final output? What was the raw output? Did it deliver any artifacts as well? Were there follow-up interactions from the user?
With each successive layer, dig deeper into the granularity of your agent's behavior: What were the tools called? How were they called? Did any fail? How so? Were there intermediate states? What were the reasoning traces?
At its core, look into the metadata: What model designation were you using? What API? What version of the harness? What were all of your configurable parameters? What information would you need to reproduce the agent's trace as best as you can?

Stitching all of this together allows you to paint a better picture of what actually happened when your agent was called. Typically this is done in the form of "session_id" with "trace_id"s underneath.