Section 11 · How do we communicate what we have learned?

Communicating Results

A number nobody reads is a number nobody acts on.

The deliverable is a decision

You did everything right. Judges built, calibrated against human labels, running on production traces, pass rates landing in a database. And then the numbers sit in a dashboard with four monthly visitors, and the product stays exactly as broken as it was. An eval suite nobody reads is expensive decoration.

The fix is a reframe: the deliverable is not a number, it is a decision someone can make from the number. Every metric you publish should come with an implied reader and an implied reaction. A regression judge dips, and the on-call engineer investigates. A capability judge saturates, and the team graduates it and picks the next hill. A new failure mode shows up in the error analysis, and someone writes a judge for it. If you can't name who reacts to a metric and what they do when it moves, you haven't built communication. You've built telemetry, and telemetry without a reader is a write-only database.

This is the test to run on every readout in this spoke. Not "is this number accurate?": that was the job of calibration. The question here is "did this number cause anyone to do anything?" The rest of this page covers the four channels where that handoff happens: dashboards, CI gates, monitoring of the evals themselves, and the humans on the other end.

Two dashboards, never one

The goals spoke split every judge into capability or regression. Your dashboards have to honor that split, because the two kinds of numbers demand opposite readings.

A regression dashboard should be boring. Every panel near 100%, all green, and wired to alert on dips. Nobody should be studying it; the dashboard studies itself and pages someone when a line moves. If your regression dashboard is interesting, you have an incident.

A capability dashboard is the opposite: trend lines climbing toward goals, meant to be read weekly, by humans, looking for slope. Flat lines are the news here: a capability judge that hasn't moved in six weeks means your changes aren't touching that capability.

And never blend the two into one composite health score. A regression judge dropping from 99 to 94 (an incident) and a capability judge climbing from 64 to 69 (a good week) average out to a flat line that says nothing happened.

Two more requirements for either dashboard:

  • Slice it. A single top-line pass rate hides everything useful. Slice by judge (which property moved), by grain (did sessions degrade while spans held), and by segment (your support bot can be flat overall while the enterprise segment quietly tanks and free-tier improves). The top-line number answers "are we okay?"; the slices answer "where do I look?"
  • Show error bars. With a few hundred cases behind a judge, a 1-point move is usually noise. Plot the confidence interval, not just the point estimate, so a wiggle inside the band doesn't get read as signal. A team that celebrates +1 one week and panics at -1 the next is reading the same noise twice and reacting both times.

Evals in CI: gates, not suggestions

Your regression suite is the unit-test analogue for model behavior, and it should run like one: on every change that can alter what the product does. That list is longer than people expect: prompt edits, tool definition changes, model version upgrades, retrieval tweaks, temperature changes. A prompt edit feels like a copy change. It is a code change to the most load-bearing code you have, and it merges ungated at your peril.

Running every judge on every PR is usually too slow and too expensive, so tier it:

  • Smoke suite, per PR: your highest-severity regression judges on a small case set. Cheap graders where possible, minutes not hours. This is the merge gate.
  • Full suite, nightly: everything, all judges, full datasets, the expensive LLM-judged cases. Failures here open tickets and land on the regression dashboard by morning.

The non-negotiable part: the per-PR gate blocks. A warning gate trains people to ignore it: the first yellow banner gets investigated, the tenth gets scrolled past, and by the twentieth your suite is background noise that costs money. If a regression judge fails, the merge stops, exactly like a failing unit test.

When the gate fires, there are only two honest outcomes. Either it caught a real regression, and you fix the change. The gate just paid for itself. Or the judge's verdict is wrong, and you fix the judge: update the rubric or case, recalibrate, and note why. What you never do is toggle the gate off to ship. Every bypass teaches the team the gate is optional, and an optional gate is a warning with extra steps.

Flowchart coming soon

When the merge gate fires

The two honest responses to a failing eval gate (fix the change or fix the judge) and the one response that rots the suite.

Eval observability: who watches the judges?

Your evals are production software, and they fail like production software. A suite with no monitoring of its own degrades silently: the dashboard keeps rendering, the numbers keep updating, and what they describe drifts further from reality. Three meters to keep on the suite itself:

Latency. Slow judges back up the pipeline. If judging can't keep pace with trace volume, traces queue, and then they get dropped, often silently, and often the longest, gnarliest traces first, which are exactly the ones most likely to contain failures.

Error rates. Judges crash. APIs time out. The judge model returns something your parser can't read. Every errored trace silently shrinks your sample, and the dashboard doesn't show the shrinkage. It shows a pass rate over whatever survived. A judge that errors on 30% of traces and passes 95% of the rest is not reporting 95%. It's reporting 95% of a biased remainder, because the traces that crash a judge (long ones, malformed tool outputs, weird encodings) correlate heavily with the traces that fail it.

Sampling rates. Almost nobody judges 100% of production traffic; cost forces a sample. Fine, but know the fraction, and know whether it's biased. If you judge 5% of traces and the sampler skips long sessions to control spend, your numbers describe a product your heaviest users never see.

The common thread: every one of these failures makes the dashboard look healthier, not sicker. Dropped traces, errored judges, and skewed samples all remove hard cases from the denominator. Green is not the same as good.

The human layer

The last hop is the one where most suites die: the handoff from dashboard to human. Three habits keep it intact.

One definition per metric, shared. If the PM, the engineer, and the on-call each carry a private definition of "resolution rate" (solved in one session? user didn't return for a week? judge said resolved?), they will argue fluently about a number that means three different things. Every metric name should link to exactly one written definition: the rubric, the grain, the dataset. When someone asks "what does this measure?", the answer is a link, not a meeting.

Label capability vs regression everywhere the number travels. Not just on the dashboard: in the Slack alert, the weekly email, the launch review slide. The same 96% is an alarm on a regression panel and a strong quarter on a capability panel, and a reader who can't see the label will pick the flattering interpretation.

Ship the eval card with the number. An eval card is a short standing document per judge: what it measures, at what grain, the dataset behind it (size, source, last refresh), calibration stats against human labels, known blind spots, and an owner. The habit that matters: whenever a number leaves the dashboard (exec email, launch doc, board slide) the card goes with it. A number traveling without its card invites the reader to imagine what it means, and they will imagine wrong, usually in whichever direction the meeting needed.

This is the report stage of the loop for a reason. The readout is where measurement either becomes work on the product or becomes a PDF.

If you remember nothing else

  • 01 The deliverable is a decision, not a number: every metric needs a named reader and an implied reaction.
  • 02 Regression dashboards should be boring and alarmed; capability dashboards should trend. Never average them together.
  • 03 No error bars, no interpretation: a 1-point move on a few hundred cases is usually noise.
  • 04 CI gates block, never warn. A warning gate trains everyone to scroll past it; a bypassed gate is a warning with extra steps.
  • 05 Monitor the evals themselves: judge latency, error rates, and sampling bias all make dashboards greener while making them mean less.
  • 06 One shared definition per metric, capability-vs-regression labels everywhere, and the eval card travels with the number.