Section 02 · What do good evals look like?

FAST Evals

Not all evals are created equal

Filtering out bad judges helps maintain trust in your evals and builds a strong foundation for your eval suite. A common fail case when building out eval suites happens when the validity of your evals comes into question: The question shifts from "How do we make the agent better?" to "Why are our evals broken?". It becomes very hard to regain confidence afterwards. At that point, every low judge score leads to an interrogation in the eval suite instead of an improvement in the product.

Given the necessity of building good judges the first time around, I have made this section the lead in the entire guide. You will start in a great spot if you think through this poka-yoke anytime you find yourself building judges.

Feasible

Before you commit any amount of time to judge creation, you should think through the feasibility of deploying this judge. Is it severe enough to warrant monitoring? Is it frequent enough that we might actually catch fail cases when sampling? Is our grading method cost-effective enough to not bankrupt us?

You will save time, money, and headaches by thinking through the feasibility of your judge before you build it.

While costliness is an easy binary decision, where costly judges are automatically rejected, severity and frequency often affect each other. I imagine this as the interaction between a severity term (importance of the capability/problem) and a frequency term (how often will you encounter this in production). Low values in one term are not always sufficient for rejecting a judge, especially if there are high values in the other. Of course this requires some estimation, so it should be considered more art than science.

	Low Frequency	High Frequency
Low Severity	Kill	Implement (use cheapest grader possible)
High Severity	Implement (Be careful not to over index on the results from small sample sizes)	Implement

Actionable

Each judge should be closely tied to a specific problem you have encountered or a capability you aim to enhance. Many eval suites fail because they attempt to measure everything in just a few judges.

For example:

Actionable: Hotel Reservation Judge. "When reservations were requested, did the agent actually reserve the room?" Limited to just one outcome or capability that you care about, this should give you insight into failing instances.
Non-Actionable: Quality Judge. "Is the information the agent outputs high quality?" Your alarm bells should be ringing if you hear someone recommend one judge for this. Quality judges are typically substitutes for doing the thinking necessary to understand what levers you can pull on your product.

Specific

Is the failure/success case limited to one binary property? Understanding what you are trying to measure and the level you are measuring it at are important.

For example:

Specific: SQL Execution Judge. "Does the SQL passed by the agent execute?" Failures from a judge like this are limited to just one subcomponent of your product. It focuses on just one potential failure mode (even though it could have multiple causes)
Non-Specific: Data Accuracy Judge. "Is the data returned by the agent accurate?" Unless you only provide data from one channel, this kind of judge is almost always dozens of judges stacked in a trenchcoat. Broad judges should always be decomposed into more specific judges focused on single subcomponents.

Testable

Evals work best when you are very clear about both success and failure modes. In most cases, the best way to create a judge is to start with a clearly-defined, refutable question.

For example:

Testable: Search Tool Invocation Judge. "Does the agent call our search tool before making a factual claim?" This question will have a clear yes or no answer for each instance in your dataset.
Not Testable: Tone Judge. "Is the tone friendly?" This is really a subjective question that raters may disagree on. The fix here is to redefine this in terms of clear properties like "Does the agent use profanity?", or "Does the agent acknowledge user frustration?"

The FAST judge-building flowchart — Run each judge through the FAST filter before you ship

When metrics fail the gates

Not every judge will pass every gate and that's ok. Some judges, especially capabilities judges as we will see later on, are extremely challenging to make cost-effective at first. Others may not be directly actionable if they are at too high of a grain. The point of the FAST gates is to help you understand what you are giving up when you make these decisions.

Eval Suites are the combination of many judges

You should feel comfortable compiling multiple judges together to build out a single eval suite. There is no benefit to attempting to build one comprehensive judge that fully encompasses the problem you have identified. With multiple judges, you can better map the surface area of your problem and catch all the opportunities as they arise.