Section 06 · How do I grade the outputs?

Scoring

Returns on your eval investments

There are hundreds of ways to go about grading your agent’s outputs, and thousands of ways to mess it up. This section is going to walk you through the 3 common types of graders, their set-up, pros & cons, and some key ideas to keep in mind when determining what grader to use for your judge.

Deterministic graders

These are the cheapest to build, easiest to interpret, and are by far the fastest ones to run. A deterministic grader will always give you the same output if you provide it the same input. These are code-based and are typically run at the event or span levels. Their main drawback is they are notoriously brittle: Every change to your agent, your harness, or your environment runs the risk of breaking these types of graders.

Here is a breakdown of code-based graders as Anthropic sees it:

Methods	Strengths	Weaknesses
• String match checks (exact, regex, fuzzy, etc.) • Binary tests (fail-to-pass, pass-to-pass) • Static analysis (lint, type, security) • Outcome verification • Tool calls verification (tools used, parameters) • Transcript analysis (turns taken, token usage)	• Fast • Cheap • Objective • Reproducible • Easy to debug • Verify specific conditions	• Brittle to valid variations that don’t match expected patterns exactly • Lacking in nuance • Limited for evaluating some more subjective tasks

Probabilistic graders

The next level up, probabilistic graders utilize some non-deterministic model to estimate the label for a given output. While useful at the event or span level, these types of graders shine at the trace and session level. Probabilistic graders are cheap enough to be used at scale and robust enough to withstand moderate changes to the agent. You may already know these graders as LLM-as-a-judge though that is just the most popular kind of probabilistic grader. Other kinds include all types of model-based binary classifiers with everything from logistic regressions to deep-learning based estimators. The main drawback to probabilistic graders is that they require calibration and some sort of ground truth to be trusted and effective.

When most people think about Evals and Judges, LLM-as-a-judge is usually what they are talking about.

Here is a breakdown of model-based graders as Anthropic sees it:

Methods	Strengths	Weaknesses
• Rubric-based scoring • Natural language assertions • Pairwise comparison • Reference-based evaluation • Multi-judge consensus	• Flexible • Scalable • Captures nuance • Handles open-ended tasks • Handles freeform output	• Non-deterministic • More expensive than code • Requires calibration with human graders for accuracy

Human graders

By far the most expensive of the three types of graders, Human graders are considered the most accurate and require the least amount of technical understanding. You simply pass the traces and a rubric to human experts who then give you a pass/fail/abstain. These kinds of grading mechanisms are inherently unscalable and occasionally require calibration as well. In my experience, using humans should only be done at the very beginning of your eval build out (as a form of error analysis) or when you are calibrating your model-based judges.

There are 3 primary types of human graders:

Expert Reviewers: These are subject matter experts who use their proximity and innate knowledge of the product to provide reviews of its outputs. While considered the best source of review data, Expert reviewers rarely scale past 3-5 people
Trained Reviewers: These are human reviewers who have general knowledge of the product or review process but would require some level of re-education from the expert reviewers to truly understand the product. They are more scalable than Expert reviewers, but are still costly to deploy
Scaled Human Review: Hiring additional support, or contractors, purely to review data can sometimes be an option of last resort when scaling your review processes. Likely the cheapest, these reviewers have the greatest risk of not understanding the product or what you are actually trying to measure.

In all cases, it is better for you to look at the data first.

Here is a breakdown of human-based graders as seen by Anthropic:

Methods	Strengths	Weaknesses
• SME review • Crowdsourced judgment • Spot-check sampling • A/B testing • Inter-annotator agreement	• Gold standard quality • Matches expert user judgment • Used to calibrate model-based graders	• Expensive • Slow • Often requires access to human experts at scale

Capability asymmetry and information asymmetry

Asymmetry is one of the most important ideas in this entire guide. Deeply familiarize yourself with the idea. Your judge has to materially differ from the agent in order to effectively assess the outcomes of your agent. The two primary ways that the judge can differ are capability asymmetry and information asymmetry:

Information asymmetry means that your judge knows something that your agent does not. This may take the form of pre-sourced golden set context or a rubric, or likely some combination of both
Capability asymmetry means that your judge has some capability that your agent does not. Some examples are code-based judges (your agent is probabilistic), subject matter experts, or better models (e.g., using GPT-5 to grade GPT-4). In many cases, this difference also presents itself as information asymmetry: your subject matter experts typically have more contextual knowledge than the agent, code based graders always know when some criteria isn’t being met, and better models are typically better because they know something weaker models don’t

Without some amount of Judge-Agent asymmetry, you end up in a world with the blind leading the blind. Reframed into a question, “What does the judge have access to that the agent doesn’t? Is it something we could just pass to the agent?” typically your asymmetry hinges on the second question resolving to “no”. If it resolves to “yes” then congratulations! You just made your agent better without having to build the judge.

Astronaut meme: 'It was always just looking at the data?' / 'Always has been.'

The map is not the territory

A key principle to keep in mind when reviewing your graders: Your graders do not represent the entire surface area of your product. Too often people will see a fully green eval suite and assume that the agent is perfect when this is far from the truth.

Pass@K versus Pass^K

Given the probabilistic nature of your agent, you will likely want to employ trials for each of your judges. This will help you ascertain how likely your agent is to be successful at a given task and how often it is successful. In other words, just because your agent passed a judge one time does not mean it will pass it the next time. To make this a little more tractable, think of an agent attempting to retrieve a customer’s hotel reservation information. Running this judge with one trial (k=1) will let you know how it did on that attempt, but it doesn’t give you any information on how likely that pass was.

Pass@K gives you the likelihood that your agent will pass a given judge at least once if given K attempts. Most of the time when you see benchmarks in the wild, they are some form of pass@K. In my opinion, pass@K obfuscates the true utility of a given agent since this isn’t how users interact with the product. When was the last time you asked an agent the same question 5 times and then knew which answer was correct? At that point why even ask the agent?

In reality, customers will ask the agent once. If it gets it wrong, they will get frustrated and bounce from the tool or move on to something else. This is where Pass^K comes into play: Pass^K measures the probability that your agent gets all K attempts right for a given K. High pass^K typically means a highly reliable agent whereas low pass^K means your agent gets the right answer only occasionally.

It is worth tracking pass^k and pass@K for at least a few trials when running your offline evals to get a better sense of the true behavioral profile of your agent.

Pin everything

Differences in scores only mean something if they are the only thing that is different. Otherwise you will catch yourself measuring harness changes conflated with model changes and agent changes with no way of knowing how much of the difference in scores came from what you care about. Saving this information directly to your eval card (Eval Design) should be considered standard practice.

It is typically best practice to pin anything that could affect your judges’ scores such as:

Models and Model Versions
Environment Versions
Prompts
Datasets
Environments
Sampling Parameters