Section 10 · How do I do this at scale?

Eval Infrastructure

Datasets, sandboxes, and scheduling: turning one-off checks into a system.

Class Eval: standardize the interface

Treat your suite as many instances of Class Eval, not a pile of one-off scripts. Every instance is constructed from the same three arguments: a judge (or several), a dataset, and an agent harness. Run it, and it reports in one standard shape: pass rates, failures, and links to the traces behind each verdict.

The point of the framing is the interface. Your first eval is expensive no matter what: you build dataset loading, episode execution, verdict collection, and reporting from scratch. The real question is whether your tenth eval is also expensive. If every eval is bespoke (its own runner script, its own output format, its own way of finding traces) the marginal cost never drops, and your suite stops growing exactly when your product needs it to grow fastest.

Standardize the constructor and everything downstream gets cheap. A new eval becomes a config change: write the judge, point it at a dataset, reuse the harness. Datasets become swappable. The same judge can run against your golden set nightly and against sampled production traffic continuously. And because every eval reports in the same shape, results are comparable across the suite: a regression in the SQL checker reads the same as a regression in the tone judge.

This is the same bet the RL community is making with standardized environments: Prime Intellect's Environments Hub treats evals and training environments as one interface precisely so they compose. You don't need their infrastructure, but you do want their discipline.

Golden and synthetic datasets

Every instance of Class Eval needs a dataset, and datasets come in two trust tiers.

Golden datasets are small, hand-verified, and expensive per row. Every case has been checked by a human who can defend it. This is the bar for correctness: golden sets calibrate your judges, gate your releases, and settle arguments. Because human verification is slow, they stay small (tens to low hundreds of cases) and that's fine. Their job is to be right, not big.

Synthetic datasets are how you scale past what humans can hand-verify, and they come in two flavors with very different risk profiles:

  • Creation generates cases from scratch: prompt a model to write 200 hotel-booking requests with conflicting constraints. Fast and unlimited, but riskier: the generator invents users who don't exist, asking questions nobody asks, in phrasing nobody uses.
  • Expansion mutates real traces into variants: take an actual booking request and change the dates, swap the city, add a typo, tighten a constraint. It stays anchored to the real distribution, which makes it the safer default.

Either way, synthetic data needs spot-checked human review on every batch. Skip it and the dataset quietly drifts off-distribution, and from then on your pass rates measure performance on an imaginary product. The numbers will look stable. They'll just be about nothing.

The offline and online eval stacks

Eval infrastructure splits into two stacks that answer different questions.

The offline stack runs before users see anything: episodes executed against datasets, inside sandboxes, on demand or on a schedule. It answers "did this change make the agent better on the cases we know about?" It's controlled and repeatable, the same input produces a comparable run tomorrow, which is exactly what you need for hillclimbing and for catching regressions in CI.

The online stack runs where your product runs: signal mined from live traffic, real users, and real consequences. It answers "is the product actually working?" It's noisy and slow but unfakeable: no synthetic dataset argues back the way a user abandoning a session does.

Neither stack is sufficient alone. Offline-only teams overfit to their datasets and ship confidently into failure modes nobody wrote a case for. Online-only teams use their users as the QA department and find out about regressions from churn curves. The two are supposed to feed each other: every interesting production failure becomes an offline dataset case, and every offline judge worth its keep eventually gets attached to sampled live traffic. The next two sections walk through each stack's parts.

The offline and online eval stacks
How signal flows between pre-ship eval runs and production traffic: failures feed datasets, judges feed observability.

Inside the offline stack

The offline stack is everything that lets you run episodes before users see them. Four pieces matter.

Scaffolding. Your agent harness is the code that runs your agent in production: prompts, tools, the loop. Your eval harness is the code that runs episodes against a dataset and collects verdicts. Keep them separate, but make the eval harness invoke the real agent harness rather than re-implementing it. The re-implementation foot-gun is everywhere: an eval runner with its own copy of the system prompt, frozen from three weeks ago, quietly measuring an agent you no longer ship. When the harnesses drift, every number your suite produces is about the wrong agent.

State and sandboxes. Agents mutate things: they write rows, send emails, file tickets. Every episode needs a fresh, hermetic world: a seeded database, mocked or sandboxed external services, a throwaway filesystem. Without isolation, episode 14 fails because episode 13 left a reservation in the table, and you burn an afternoon debugging a bug that doesn't exist.

Distribution. Episodes are independent, so fan them out in parallel. A suite that takes four hours gets run weekly; a suite that takes ten minutes gets run on every change. Parallelism is the cheapest eval-quality investment you'll make.

Scheduling. Two cadences cover most teams: a nightly full suite over everything, and a per-PR smoke suite: a fast, high-signal subset that gates merges. The smoke suite catches the regression before it lands; the nightly run catches whatever the smoke suite missed.

Mining production for signal

Offline evals tell you about the cases you thought of. Production tells you about everyone else. Four sources, roughly in ascending order of volume:

A/B testing is the ground-truth eval. Ship the change to a slice of traffic and measure real outcomes: bookings completed, tickets resolved, sessions retained. It's expensive, slow, and needs enough traffic to reach significance, which is exactly why you can't use it for everything. But when offline numbers and A/B results disagree, the A/B is right.

Semantic signals are users grading your agent in plain text: "that's wrong", "no, I said Tuesday", "let me talk to a human". Mine your message logs for these. They're free labels on real failures, written by the people your evals exist to satisfy. Users are sometimes wrong about the facts, but they're never wrong about being frustrated.

Action signals beat sentiment. A regenerate click, a session abandoned three steps into a booking, an answer copied and then immediately rewritten. Behavior is honest in a way feedback forms aren't. The users most worth hearing from quietly leave without typing anything.

Product observability closes the loop: attach eval verdicts to live traffic. Run judges (sampled, see the economics below) over production traces, so that when something fails it arrives in your tracing tool already labeled, sitting next to the exact trace that produced it. Tools like Raindrop are built around this pattern: production traffic treated as a continuously judged dataset.

Eval economics

Judges cost money per trace, and production traces arrive by the million. A judge at $0.002 per trace sounds free until you multiply: five judges across a million daily traces is $10,000 a day. Tractability doesn't stop at build time: the same T in FAST that decided whether a judge was worth building now decides whether you can afford to keep running it.

Three levers keep the bill sane:

  • Sample asymmetrically. Judge 100% of suspected failures (they're rare, and they're the point) and sample successes at a few percent, enough to track the pass rate without paying for every confirmation that things are fine.
  • Gate with deterministic checks. Run cheap code-based graders first: did the SQL execute, did the response contain a confirmation number, did the agent actually call the refund tool. Only traces that clear the cheap tier and still look ambiguous earn an LLM judge's attention.
  • Right-size the judge model. Most judges don't need a frontier model. A small model with a tight rubric, calibrated against your golden set, often matches the big model's agreement rate at a tenth of the cost.

Across all three levers the goal is the same: spend judge budget where a verdict can change a decision, and nowhere else. A judge you can't afford to run is a judge you don't have.

If you remember nothing else

  • 01 Every eval is an instance of Class Eval: judge + dataset + agent harness in, one standard report out. Get the interface right and the next eval gets cheaper.
  • 02 Golden data sets the bar; synthetic data scales it. Spot-check every synthetic batch or it quietly drifts off-distribution.
  • 03 Keep the eval harness separate from the agent harness, and make it call the real one. Drift between them means measuring an agent you do not ship.
  • 04 Agents mutate state, so every episode gets a fresh, hermetic world. Leftover state turns real signal into phantom bugs.
  • 05 In production, behavior beats sentiment: regenerate clicks and abandoned sessions are eval signal, not just product metrics.
  • 06 Judge cost times traffic is real money. Gate with deterministic checks, judge all failures, sample successes.