Section 12 · How do we use evals to improve our product?

Iterations

Evals aren't a report card. They're the steering wheel.

The steering wheel, not the report card

It's easy to build an eval suite and then use it like a report card: run it weekly, screenshot the dashboard, feel good (or bad), change nothing. That's the most expensive way to own evals: you pay the full cost of building and calibrating judges and collect almost none of the value.

The suite's real job is to answer one question, over and over: of these two versions of the agent, which one should exist? An absolute score is nearly meaningless on its own. "87% pass rate" tells you nothing about whether your new prompt is an improvement; "87% versus the baseline's 82%, on the same cases, outside the error bars" tells you exactly what to do next.

This reframing changes how you treat every number. A score isn't a grade to be proud of; it's one side of a comparison waiting for its other half. If a judge's output never changes a decision about what ships, the judge is decoration. And it changes what "done" means: an eval suite isn't finished when the dashboard is green; it's finished when it's the default way your team settles arguments about what to change.

Every spoke before this one was about making the measurements trustworthy. This one is about the payoff: pointing those measurements at candidate changes and letting them drive.

The offline experimentation loop

The core loop is dull on purpose:

  1. Propose a change. A reworded system prompt, a new tool definition, a different model, tighter retrieval chunking, anything you suspect might help.
  2. Run the suite on the candidate and the baseline, same cases, same judges.
  3. Compare with error bars. A 2-point bump on 50 cases is one flipped case, noise. Both runs share the same cases, so compare paired: put the interval on the per-case difference, not on each score separately. If zero sits inside that interval, you don't have a result yet.
  4. Ship or discard. Either way, you learned something. Log it.

What this buys you is the upgrade from opinion to evidence. "I think the new prompt handles refunds better" starts a debate; "the refund-policy judge went from 71% to 90%, well clear of the noise" ends one. Judges are how prompt engineering stops being vibes.

Two habits keep the loop honest. First, change one thing at a time when you can. If you swap the model and rewrite the prompt together, a regression has two suspects. Second, watch how often you peek. Every time you check the suite, tweak, and check again, you leak a little information about the test set into your changes. Do it fifty times and you've quietly overfit to your eval cases: the suite says you improved, production says otherwise. Keep a held-out slice you only touch before shipping. The foot-guns spoke covers how badly this goes when ignored.

The eval-driven improvement loop
Propose a change, run the suite against the baseline, compare with error bars, then ship or discard, and log the result either way.

Closing the loop: judges as optimization signal

Once judges are programmatic, a human doesn't have to be the one proposing changes. Judge scores are a signal any optimizer can climb.

The simplest version is automatic prompt optimization (APO): an outer loop generates prompt variants, your suite scores them, the best survive. GEPA-style evolutionary search takes this further (mutate prompts, evaluate, select, repeat) with your eval as the literal fitness function. These methods routinely find prompts no human would write, because they're searching a space humans get bored in after four attempts.

One level up are auto-research loops: an agent reads failing traces, hypothesizes a fix (a prompt edit, a tool description tweak, a new few-shot example), applies it, and lets the suite render a verdict. Run that overnight and you wake up to a ranked list of candidate improvements with scores attached. It's the experimentation loop from the previous section with the human moved from operator to reviewer.

The governing rule: the stronger your judges, the more autonomy you can hand the loop. With sloppy judges, automated optimization is a machine for manufacturing regressions that look like wins. And the pressure is asymmetric: a human iterating on prompts stumbles into a judge's blind spot occasionally; an optimizer running thousands of evaluations will find every blind spot and move in. This is Goodhart's law with a search algorithm behind it. Reward hacking isn't a hypothetical failure mode of frontier labs; it's what happens to your refund-policy judge on iteration 600.

Evals are environments

Look at what a good agent eval already contains: a task definition, a sandboxed place for the agent to act (a mock booking API, a scratch database, a containerized filesystem), and a verifier that checks the outcome. Now look at what an RL environment needs: a task, a place to act, and a reward signal. Same parts. A well-built eval is most of an RL environment with the training loop left off.

The labs have stopped treating these as separate artifacts. Prime Intellect's verifiers framing makes it explicit: an environment is a task plus a verifiable reward, and the same object serves evaluation, RL training, and synthetic data generation. Write it once, use it for all three.

You may never run RL, and this still matters, because building evals as environments forces three properties you want anyway:

  • Executable. The eval is code that runs end to end, not a spreadsheet of transcripts someone has to interpret.
  • Reproducible. Sandboxed state means the same case yields the same setup every run, so regressions are real, not flaky fixtures.
  • Reusable. A new model drops, you point the environment at it, and you have comparable numbers in an hour instead of a quarter.

And if the day comes when you do want to fine-tune or run RL against your product's tasks, your eval suite stops being a cost center and becomes the head start. The teams with strong verifiers are the teams that can train.

Grade the agent you have, not the one you wish you had

A subtle way improvement loops rot: grading curated transcripts instead of the agent's own behavior. You hand-write a tidy ten-turn conversation, drop the agent in at turn eight, and score its reply. Clean, reproducible, and off-policy. Your agent never sees turn eight of that conversation in production, because its own turns one through seven look nothing like your script. You're measuring performance on a distribution the agent doesn't live in.

Off-policy evals flatter the agent you wish you had. The hand-written history quietly avoids the mess your agent creates for itself: the ambiguous tool result it half-parsed at turn three, the wrong assumption it confidently carried forward. Production failures are usually compounding errors, and curated transcripts compound nothing.

The training world learned this the hard way. On-policy distillation (grading and correcting a student model on trajectories it sampled itself, rather than on the teacher's transcripts) beats imitating curated data, because the student gets feedback exactly where its own behavior goes wrong. The eval lesson is the same: signal is most useful on the states your agent actually reaches.

Practically: seed eval cases from real production traces, not imagined ones. Let the agent generate its own full trajectories during evaluation instead of splicing it into golden histories. Refresh cases as the agent changes, since fixing one behavior shifts which states it reaches next. The agent in your eval should be recognizably the same animal as the one in production, including the limp.

If you remember nothing else

  • 01 An eval suite exists to choose between candidate changes. A score on its own decides nothing; a comparison with error bars decides everything.
  • 02 Every change (prompt, tool, model, retrieval) goes through the same loop: run the suite against the baseline, compare, ship or discard.
  • 03 Peeking at the suite to iterate slowly overfits your agent to your eval cases. Keep a held-out slice you only touch before shipping.
  • 04 Programmatic judges can drive APO and auto-research loops, but optimizers route through your weakest judge. Autonomy is bounded by judge quality.
  • 05 A good eval with a sandbox and a verifier is most of an RL environment. Build it that way even if you never train.
  • 06 Grade on-policy. Curated transcripts flatter the agent you wish you had; production-shaped trajectories grade the one you actually ship.