Section 10 · How do I do this at scale?

Eval Infrastructure

Evals are larger than the sum of their parts

Standardize Evals

Building out evals requires a basic understanding of the levers that can affect your agent. To actually measure what you want to measure, everything else in your environment must be held constant. Holding everything constant is easy to do when you are running a simple eval script and is hard to do when you are working with a complex Agent-Eval Harness-Environment interface. Simplifying the way your team interacts with the suite can help reduce this complexity for them and reduce the chances that random permutations of eval suite go undetected.

To borrow from OOP and experimental design principles, every instance of your offline evals should be of class Eval and should take in your Judge (or judges), your Agent (the version of the harness you wish to test), and a Dataset as key properties. Everything else should remain consistent across your instances.

As you think about online and offline infrastructure, you should take time to think about the types of experimentation that can be run on either one. Online experimentation typically follows standard A/B testing formats where some proportion of your user base is exposed to the treatment group (changes to your agent) while the rest of your user base experiences the old version of the agent. Offline experimentation allows you to take control of more aspects of the experiment, but comes at the cost of differing from how people are using your agent in the real world. There are hybrids such as exposing modified versions of your agent to "shadow traffic" where you pass the same inputs as your production traffic to a hidden version of your agent, but those are typically more advanced alternatives.

Datasets

There are multitudes of ways to harvest or create datasets for your evaluation suite. At its simplest, looking at your production data should provide you with ample opportunity to create your initial datasets for your harness.

Initial Datasets should be pulled directly from production or test runs of your agent.
"Golden Set" datasets are those datasets composed of curated input-output pairs that correspond to the behavior you want to see in your product. Use these for capabilities judges.
When you move to online evals, you will return back to using production datasets, these datasets typically don't have a reference output, so you must rely on well-designed rubrics or other types of graders that don't require human intervention.

Building out golden paths or input-output pairs that you know are correct or incorrect can help you calibrate your judges as you build up confidence in their rubrics, though this is time-consuming. Eventually your eval suite and agent will mature, and you can begin to experiment with other dataset harvesting methods or even synthetic data creation.

Bernie Sanders with the caption "I am once again asking for you to look at the data". — Taking a page from Husain, your best bet for identifying what should be in your data is to look at the data.

Offline Infra

Eval infrastructure is typically split into two stacks: offline, where your evals run at predetermined times inside sandboxes and against pre-constructed datasets, and online, where your evals run against production traffic. These two stacks ideally work in harmony. Your online stack's caught regressions become the dataset you hillclimb against and your offline experiments determine what agent modifications get promoted to production.

Your Eval Harness set-up is the key to building a good offline infrastructure for your evals. The Eval harness (as described in Eval Design) will handle everything for orchestrating offline evals at scale. A good eval harness should handle rate limiting, environment configs, automatic retries, and result/report generation. The primary goal of offline evaluation is to probe the agent's behavior in a controlled setting ahead of any impact on production.

Scaffolding your offline eval stack requires you to be aware of the difference between your agent harness, your eval harness, and your environment. The agent harness is all the code that makes the agent work (think about the prompts, the tools, and the loop or whatever mechanism you employ to get repeated calls to the model underneath). This will frequently update so it should be considered an input into your eval harness, but not part of the eval harness itself. Keeping the Agent harness that you use in evals in sync with the one operating in production is a problem worth keeping in mind.

The state and sandbox that your agent operates in should stay clean from run to run. The state is everything that the agent can interact with and manipulate. It typically contains seeded databases, sandboxed external services, and any sort of mocked information that your agent would run into in production. Controlling the state of the environment will help prevent you from measuring changes in the sandbox as improvements or degradations in your agent.

Judge breakdown:

Capabilities judges are typically run in an offline setting with pre-determined input-reference output pairs meant to simulate what an ideal high capability agent should have done. These are most similar to many of the benchmarks that you see online where the answers have already been pre-computed
Keeping a stable of regression judges in your offline eval suite is exceptionally useful for catching problems before they hit production. One method for this is running your modified agent through your eval with one capability judge, your typical stable of regression judges, and a curated input dataset meant to represent typical user behavior
Regression judges should be calibrated in an offline setting before being deployed online.

Instead of offering guidance on how to schedule and when to batch or parallelize, I offer this: What will you do with the information that you gain from running your offline evals again? If your backlog is already full, then the answer is "not much" in which case you should take longer times between runs. If your plan is to use Eval-Driven Development then you should be running evals as frequently as needed to measure your progress. Scheduling your evals does not have a one size fits all approach, so take the time to think through how you will use the information.

Online Infra

While most product metrics are lagging indicators that can take days to weeks to reflect the impact of changes to your agent's behavior, online evals allow you to map out what your agent is doing as it is doing it. This can take you from investigations like "Why has retention dropped over the last month?" to actionable intel like "Our reservations subagent is failing". The primary goal of online evaluation is to understand the behavior of your agents on live production traffic. Some key use cases:

Catch regressions as they begin impacting your users instead of after
Monitor experimental versions of your agent
Guardrails that protect against accidental data leakage, prompt injections, or egregious hallucinations

Guardrails & Online Steering

There are two types of guardrails worth building into your agent: Blocking guardrails, which end the turn if they activate, and Steering guardrails, which subtly intervene to get your agent back on track.

Some examples of Blocking guardrails:

Input Guardrails
Inline Guardrails
Output Guardrails

Some examples of Steering guardrails:

Classifier Guardrails (sentiment, intent, etc)
Routing Guardrails

Eval Economics

Most non-code-based judges cost money to run. The costs can scale quickly with increases in context and with the number of judges you run on every trace. While a haiku judge at a $1 per million tokens sounds cheap, an average cost per trace of $0.2 across 5 judges and 100K traces per day results in a steep bill by the end of the month. Some tricks to keep costs down:

Sample: Don't run your judges on every trace or session that comes through. Take some sub-section of your total traffic and run the judges on that.
Shift to ad hoc or offline evals: Not all your judges need to be run all the time. Capabilities judges only need running when something changes that could affect your agent's capabilities.
Shift models: Not all evals need the frontier level of intelligence to do their job. As part of your calibrations, you will have experimented with different models. It is ok to sacrifice a point or two in precision if it saves you thousands of dollars per week. (Also check out open-source models, they are surprisingly good judges)
Use cheaper alternatives: Not everything needs an LLM-as-a-judge. I know the new way is LLMs for everything but sometimes it is cheaper to just implement a code-based heuristic.

'Coding in 2022 vs 2027' meme: a one-line string check replaced by a call to the Anthropic SDK just to decide whether the first letter is capitalized.