An adversarial eval is a gate, not a metric

I had been running model comparisons for two days — current production model against a newer one with reasoning enabled, on the same prompt, same dataset, same judge. The numbers were clean: 76.4 % accuracy on the old model, 91.5 % on the new one. Three quarters cheaper to run. Same provider, same SDK, one env var to swap. I drafted the Slack message in my head before the eval even finished.

Then somebody on the team asked: "isn't that too good to be true?"

It wasn't, in the literal sense. The prices were public, the runs were in LangSmith, the delta was real. But the +15 pp figure would have been copied into a doc, repeated in a meeting, and used to justify a migration. The moment somebody ran the new model on real traffic and saw +3 pp instead of +15, the credibility of the whole experiment would dissolve.

An adversarial dataset is not a representative one

The dataset I was measuring against was 54 examples curated over a year, one Slack thread at a time. Every row started life as a bug report. "The agent hallucinated a partnership." "It picked Yes when the data clearly said No." "We demoed this and it got the founding year wrong." Somebody added it to the dataset, wrote a one-line expected behaviour, and moved on. Over time the set became a tight, painful collection of every way the agent had embarrassed us.

That's what an adversarial eval dataset is supposed to be: a regression test suite for the model's known failure modes. It exists so that when you change a prompt, a tool, or a model, the failures you fixed last month don't silently come back. LangSmith's eval docs are organised around this pattern: the dataset is a contract, not a sample.

A contract dataset and a representative dataset are different objects with different math. A 76 % score on an adversarial set does not mean the model gets 76 % of real questions right; it means it gets 76 % of the known-hard questions right. The two numbers can diverge by 20 pp in either direction. The delta between two models is distorted too: the new model wins big precisely on the cases the old one was curated to fail at. The research literature argues for combining curated stress tests with representative sampling (on what models can and can't know, on eval-set design), but it's easy to nod at that in a paper and forget it staring at a green +15 pp delta in the UI.

The message I almost sent

The draft in my head was three bullet points: cost down 75 %, accuracy up 15 pp, latency irrelevant for batch. Two sentences of caveats, ship it.

The "too good to be true" question made me reconstruct what I was claiming. The cost number was a measurement: the new model is published at a quarter of the old model's price per token, and the billing logs confirmed it. The +15 pp was a measurement against a dataset whose every row was selected because the current model failed on it. The new model winning by 15 pp on that set is the minimum bar to clear, not the headline.

I rewrote the message. Same numbers, different framing:

Cost: -75 % on identical runs, measured from billing logs. A measurement, not an estimate.
Accuracy: +15 pp on a 54-example adversarial dataset. Real-world delta will be smaller, probably +5–10 pp, and we won't know without sampling production traffic.
Caveat: the dataset is curated from known failures, so it overstates the gap. The eval confirms no regression on the cases that hurt us, not average production behaviour.

That version travels safely. The first one would have set up an expectation the new model couldn't meet in week two of deployment.

Two datasets, two jobs

The fix isn't to apologise for adversarial datasets. It's to stop asking them to do a second job they weren't designed for. Our eval suite needs two datasets, side by side:

The adversarial set we already have. 54-ish examples, curated from real failures, growing one row per incident. Used for regression detection: if a change drops the score here, it's breaking something that hurt us before. The number on this set is a gate — binary, fast, run on every change.
A representative set we don't have yet. 200–500 examples sampled randomly from the last 30 days of production traffic, ground truth annotated by hand. This is the number you quote in Slack and in the migration doc. Expensive to build, slow to refresh, and the only honest source for a headline accuracy.

Confusing them is the cheap mistake, because the adversarial set is already in source control while the representative set is always a quarter away — it requires somebody to sit down and annotate 300 rows. So when the migration question lands on your desk, you reach for the dataset in front of you and report its number.

What I do now

Tag the dataset by intent in its name. -eval-adversarial and -eval-traffic. Not in a README, in the name. The name shows up next to the score everywhere it's reported.
Quote the dataset shape in the same sentence as the number. "+15 pp on the 54-example adversarial set" reads differently from "+15 pp accuracy". The first invites the right question; the second invites trust the data hasn't earned.
Treat the representative dataset as a project, not an intention. It will not appear on its own. Block a day, sample the traffic, get a teammate to annotate alongside you.

What I'm still not sure about

I don't have a clean answer for how to grow the representative set without it becoming adversarial by stealth. If I sample 300 production examples and annotate them, the cases the annotator writes long notes on are exactly the ambiguous, near-miss ones — the same selection bias that built the adversarial set in the first place. The likely answer: re-sample periodically, have the annotation done by somebody who didn't pick the examples, and review the score alongside its sample size and date range. None of that is hard. It's process, and process is what gets dropped when there's a green number on the screen and a Slack draft waiting.

An adversarial eval is a regression test for the failures you already know about. Before quoting a delta on it, ask whether you're reporting "did this regress?" or "how good is this on average?" — they need different datasets, and only one of them is sitting in your repo today.