Eval Design · FAST Evals

Information asymmetry and the ground truth

The eval built in the last section won't have anything to judge if you do not think carefully through input design. Here you must think through how you will create your dataset for your evals and how you will pass through all the necessary information.

Information asymmetry is the idea that your judge has privileged access to information that your agent did not have access to at the time it completed its session. As you design your inputs for your judge you must keep this in mind or else you will fall victim to simply trading one AI's judgement for another. This oft-overlooked part will have a tremendous impact on the quality of your judges.

In practice, this looks like building ground truth datasets or passing additional information to your judges (e.g., the correct materials needed to get the right answer to a research question) that your agent would have to find on its own. In cases where you are using a less-than-frontier model, you can occasionally circumvent building ground truth data by employing the capability asymmetry (in reality this is just built-in information asymmetry) of a stronger frontier model.

Some common forms of information asymmetry:

Golden Answers: These are the most expensive to create since they require human intervention to build, but they are often the most reliable in production. The judge simply has to compare the correct answer against the agent's answer to make a decision.
Rubrics: Written criteria for the judge to use when grading the outputs. This gives the judge an advantage over simply asking "is this good?" but will often require some level of calibration to make sure it is accurately measuring the property.
Stronger models as judges: Stronger models often can operate as judges for smaller/weaker models since they will possess information that the weaker ones do not by default. This is the cheapest to create but the riskiest to productionalize.

Outcome vs. process

The next part to consider is whether you are measuring the outcome versus a process. Reframed as a question, are you assessing the final result of your agent's behavior or the route it took to get there?

Process: Much like the actionability gradient from using finer measurement grains, judging the process allows us to quickly ascertain whether a specific action is occurring within the agent's trace. This comes at the cost of fragility. If you or your dev team decide to materially update the agent's tooling or harness, process judges will often break and need to be recalibrated.
Outcome: On the other hand, outcome judges are extraordinarily robust to changes in harnesses, environments, or tooling. As long as the outcome you are measuring does not change, this judge will always work. These come at the cost of actionability and specificity. It is often hard to build a judge that measures outcomes without it becoming too vague to be useful. To take a more traditional product example, drops in user sentiment indicate that people are not liking your product as much, but without clear causal chains, it is extremely difficult to pinpoint why that would be the case. Outcome judges work best when they are narrowed to specific portions of your agent.

Defaulting to the outcome grading when the outcome is checkable is a good rule of thumb. Knowing when to add in process checks is more of an art than a science. Only you and your team will know how stable different parts of your harness will be in the future.

Environment vs. harness

Eval Harnesses and Environments are challenging to disambiguate. In many cases, they can be the cause of much stress for eval teams, particularly when scores are dropping. Understanding each, and the interplay between the two, is essential to building out a useful eval suite.

The eval environment is everything that the agent can act on. It should be considered a constructed, reproducible state. This includes the tools the agent can call, the data it can act on, and the sandbox it is working in.
The eval harness is everything around the environment and the agent. This is where the code that starts the eval lives. Here you can pass in the task, collect the trace, and pass off to the grader. This is not the same as your agent harness, and should be thought of as agent-agnostic. You should be able to pass any agent or model into your eval harness without touching the eval harness code.

Pressure testing

You should pressure test your judge against a few dozen known-good and known-bad traces. This is a simple (look at your data), but effective way to identify many edge cases of your judge long before you would notice them in production. Two questions to ask yourself are:

Is there a way to make this judge pass without exhibiting the desired behavior?
Can this judge fail while the underlying agent still exhibits desired behavior?

While this can be a time consuming step, you will be surprised by how many edge cases you will find in your rubrics and assertions without it. In practice, this looks like dry-running your judge against a dataset of traces then manually counting the ones where it fails to assign the correct label. If it fails a lot then you must iterate on the rubric and try again.

Input / Dataset Design

Testing your judge with the correct data will save you tons of time later by helping you drastically lower your false positives and false negatives before your judge sees a production dataset. These pressure testing datasets are typically some mix of the following:

Sampled production traces: Known-good and Known-bad traces taken from real data. If you subscribe to Husain's camp of Error Analysis, this part is built in for you. Just look at your previously labeled data
Hand-drawn traces: These can help create cases that you expect to see in the future that may not be visible right now.

Make sure that your dataset covers happy paths, known edge cases, adversarial inputs, and negative cases. It also helps to provide these samples across the spectrum of difficulty (I will note that this is very time consuming to do).

Eval cards

Versioning is just as important for your judge as it is for your product. Essentially you want the version of you (or your agent) that exists 6 weeks from now to be able to quickly understand exactly what this judge does, why it was made, how to use it, and what to do about it. What this looks like in practice is an eval card with values on it such as:

Failure Mode
Motivating Traces
Owner
Last Calibration Date
Agreement Rate
Privileged Context
Dataset
Measurement Grain
Unit under test
And others

Without this, each time you re-engage with a judge, you have to teach yourself everything about it as if it was the first time you encountered the problem.

A filled-in eval card: the one-page design spec for a fabricated-citation judge, showing its position, dataset, privileged context, FAST checklist, calibration stats, and changelog.

Fabricated citation check

judge.cite_fabrication · owner: evals@acme

v1.3 calibrated

grain: trace regression scorer: URL check + LLM judge

Failure mode

Response cites a URL or paper that does not resolve, or that does not support the claim it's attached to.

Motivating traces

tr_88412tr_88679tr_91037

Unit under test

End-to-end response, post-retrieval (full RAG pipeline)

Privileged context

Retrieved documents + live URL resolution results

Dataset

142 real traces + 60 perturbed variants, incl. 38 negatives (correct citations that look unusual)

Pass criteria

Fail on ≥1 fabricated citation; CI gate requires 0 fails on golden set, evidence span quoted per verdict

Expected base rate

~3% of production traces fail (last measured May 2026)

Recalibration triggers

Model swap · judge prompt edit · base-rate drift >2x · quarterly review

Feasible
≈$0.004 per trace

Actionable
quotes offending span

Specific
one property only

Testable
binary verdict per trace

Calibration · n=120 labels · last run 2026-05-28

Agreement

94%

Cohen's κ

0.81

TPR

0.92

FPR

0.04

v1.3 tightened evidence quoting after FPR spike · v1.2 added 38 negatives · v1.0 initial