Section 13 · What are the common fail-cases?

Foot-guns

The road to eval hell is paved with good intentions

Silently failing eval suites

There are dangers lurking even when your evals are all showing green. Below is a non-exhaustive list of common threats to your eval suite, bucketed by the type of impact they will have on your evals. Each of these threats could wreak havoc if they go undetected for too long.

Instead of explaining each of these in depth, I recommend copy and pasting this section directly into your agent and pointing it at your judge and eval set as you build it. Most models can quickly identify when these threats are at play when prompted.

UNO 'Draw 25' meme: the choice is 'look at the data' or draw 25.

Validity Threats

Reward-hacking ≈ Goodhart’s Law ≈ Judge Hacking ≈ Objective Mismatch
Eval⇔Agent Harness Drift
Overfitting
Distributional Shifts / Data Drift / Covariate Shift
Sampling Bias → Train-Serving Skew / Train-Test Mismatches
Model Drift
Selection Bias → Collider Bias → Berkson’s Paradox

Eval Quality Threats

Hallucinations (Judge)
Treating everything as a regression judge
Edge-Case Myopia, Sensitivity, and False Positives
Position bias
Verbosity bias
Self-preference bias
Context Poisoning / Instruction Adherence

Operational Threats

Parkinson’s Law & Token budgets
Synthetic Datasets (Over-reliance on AI)
Latency
Scaling Bottlenecks
Cache contamination

A common pattern

The surface level failures vary widely from area to area, but there is a common throughline amongst all of these footguns. Each failure is simply a case of our proxy not lining up with reality. When reviewing your evals a great question to ask yourself is “Are we as close as we can get to measuring what is really happening?” You will be surprised at how often the answer is “No.”