Section 05 · What is the point of evals?

Eval Goals

Match your evals to your goals

Goals of Judges

When building a judge there are two main goals a judge can seek to accomplish. Each judge must focus on enhancing the capabilities ceiling or raising the floor of the product quality. "Am I trying to build a new capability or prevent a problem from occurring?" should be one of the first questions you ask yourself when building out a judge.

Capability Judges are those judges who attempt to measure the capabilities of your agent/product. They should start with low scores and climb as your agent improves. These judges are more challenging to build as they often require more complex graders and high levels of information asymmetry to be truly effective.

Regression judges are those who attempt to catch problems or errors in your agent's behavior. These should stay near 100% after your initial fixes and should alert if they drop beyond a certain threshold. Often considered simpler to identify, develop, and implement, Regression judges will make up the backbone of your eval suite.

Capabilities judges climb, regression judges hold the floor

Binary Verdicts & Abstention

No matter what your goal is, your judge should emit one of three outcomes: Pass, Fail, Abstain.

Just like humans, LLMs struggle with fuzzy decision boundaries. This makes Likert scale-style scoring methods more precise in theory but useless in practice. The ambiguity between 3, 4, and a 5 on a question like "Does this email contain the correct citations?" will have you searching for solutions to a problem that might not exist or give you false confidence on a true issue with your agent. Binary outcomes make it easier to identify edge cases and errors in your judge's setup or rubric without having to wade through a bunch of indeterminate middle cases. This also helps frame your questions in a testable way: If a valid reply to your question is "kinda" then the question could probably be broken down into multiple more pointed questions.

Communication of Judge Goals

If your regression judge had a 45% total score after a round of offline evals there are serious problems that you need to fix, but a capabilities judge at 45% is completely fine. A challenge that comes with working with two nearly identical species of judges is communicating the differences effectively. I cover this more in the Communicating Results section, but at the surface you should attempt to tag and split the judges whenever possible.

Your audiences for the judges should differ, and, in the same way that some metrics are meant to be internal versus external, you should seek to only surface the relevant judge for the appropriate audience. Splitting dashboards by "capabilities vs. regression" is an effective way to do this without putting in a lot of thought for each judge. A good question to ask yourself is "What would be the immediate reaction if this judge's score were to drop to 30?" If the answer isn't emergency meetings and war rooms then it is probably a judge worth keeping to yourself or a limited audience.

Graduating Judges

If all goes well, eventually your Capabilities judges will saturate and hit 100% at a stable rate. When this happens, you no longer care about raising the capability ceiling of your agent and should begin to care more about maintaining your newfound abilities. These judges typically transition from capabilities judges to regression judges.

While the transition from offline capabilities judge to offline regression judge might be as simple as changing a tag, converting your offline judge to its online counterpart will require rethinking how you go about judging pass and fail cases. Depending on your initial judge set up, you will essentially have to create a reference-free judge alternative since you can't curate the correct answers for all possible scenarios.

In many cases, new failure modes may begin to appear within your agent's capabilities that will require the addition of other regression judges. When that happens, you may see overlap with your capabilities judge at which point retirement or modification could be useful.