The eval dataset has two readers

A week after we shipped the LLM-as-judge into our prompt iteration loop, I ran the first full experiment with it. Forty-nine examples, two repetitions each, judge enabled. The run finished. I opened the LangSmith UI expecting to read forty-nine judgements. Instead, more than half the rows showed grounding_judge: —. No score. No comment. A small note in the feedback: "skipped (no notes/expected_behavior/evaluation_criteria and no failure_mode)."

Thirty-one of forty-nine. The judge had nothing to judge against.

The reflex is to blame the judge code — surely the criteria-extraction is too strict. It wasn't. The judge was reading example.metadata.expected_behavior and example.metadata.failure_mode, both empty on most rows. But the examples were not without information. Almost every one of those skipped rows had a field called notes in its outputs, with sentences like "They do have manufacturing locations in Canada", "The CEO is Markus Halvorsen", "Make sure it's using data about the right company". Human-written, useful, in the wrong place for the machine.

This is a small operational story but it's the one the eval tutorials skip, and it's the one I want to write down. An evaluation dataset has two readers — the human teammate browsing it and the LLM judge consuming it — and they need different things from the same row. If you only design for one, the other goes silent.

The two readers of an eval dataset

Most eval frameworks treat the dataset as a programmatic input. Inputs, outputs, metadata, evaluators. Clean. And in the demos that's all there is — someone seeds twenty examples, the judge reads them, the scores come out.

In practice, the dataset is also a shared engineering artefact. The product manager pastes a failing case into it. The on-call engineer adds the run that broke yesterday. The intern auditing the agent leaves a note that says "this should be a yes, the company is clearly headquartered in northern Italy." The dataset is, before anything else, a Slack channel for the team's collective memory of how the agent fails — written in prose, indexed by example.

That prose lives where it's easiest to write it: a freeform notes string in the example's outputs. It's the right place for a human. You open the example, you see your context, you remember what this case was about. For a year that worked fine — we were the only readers.

The moment the LLM-as-judge arrives, the dataset has a second reader who can't see prose unless you tell it exactly where to look. And no judge implementation I've seen, including ours, defaults to reading freeform notes. The judges expect structured criteria: expected_behavior, failure_mode, evaluation_criteria. Strings the prompt template can interpolate without ambiguity.

So you end up with this asymmetry. The team writes truth in prose, in the field that's natural to write prose in. The judge reads truth from structured metadata, in fields that don't exist yet. The two never meet. And because the judge silently skips rows it can't evaluate, you don't notice for weeks — your dashboard shows a clean score on the rows it could grade, and you assume the rest passed.

The instinct that makes things worse

When I first saw the skipped rows, my reflex was the obvious one: migrate the notes into metadata. Move the prose from outputs.notes into metadata.expected_behavior, drop the old field, done. One source of truth.

That instinct was wrong, and I want to be specific about why, because the wrong move here is the one a tidy engineer reaches for first.

The notes are not redundant context the team would happily lose. They are how the team reads the dataset. When a teammate opens an example in the LangSmith UI, they look at the Example tab first. The metadata tab is two clicks away and rendered as a key/value table that's optimised for short strings, not paragraphs. If I move the notes there, three things happen at once:

The reviewer's workflow breaks — the prose they were scanning has vanished from the place they were scanning.
The metadata view, which is currently a clean four-field strip, fills up with prose and loses its structural function.
Worse: the moment a teammate next adds a case, they go back to writing in notes because that's where they always wrote. Now the dataset has prose in both places, half-overlapping, drifting.

"Single source of truth" is the right rule for production data, and the wrong rule for a shared engineering artefact with two readers that have different access patterns. Pretending the two readers want the same shape forces one of them to lose. Production code can't have two truths because they will diverge under load. A small eval dataset, edited by hand by a small team, can — if the duplication is deliberate, asymmetric, and one direction.

The rule I settled on

What I did instead was duplicate, deliberately, in one direction:

The outputs.notes field stays exactly where it is. Untouched. The team keeps writing there, the LangSmith UI keeps rendering it, the existing workflow is preserved.
For every example that has a non-empty notes, the same string is copied into metadata.expected_behavior. The judge reads from there.
For every example that has neither, metadata.failure_mode gets a default value ("grounding", in our case — the broadest of our two failure types). This gives the judge enough criteria to evaluate the row against the system's default grounding rules, even when nobody has hand-written a specific expectation.
The judge code only reads metadata. Period. It never inspects outputs.notes. The duplication is one-way: human writes prose, the prose gets reflected into metadata for the judge.

That's it. Forty-nine rows, three minutes of running an update script, and the next experiment had a score on every row. Back to forty-nine of forty-nine.

This works because the duplication is directional and bounded. The team has one place to write (notes) and the judge has one place to read (metadata). The sync runs forward only — there's no path where the metadata edit fails to propagate back to notes, because nothing in the loop edits metadata by hand. Eventually that sync becomes a hook on dataset write, or a CI check on the dataset diff. For now it's a script, and that's fine.

The metadata schema I ended up standardizing has five fields, two of them load-bearing for the judge:

{
  "failure_mode": "grounding",                         // judge reads
  "expected_behavior": "Agent should fetch X first...",// judge reads
  "original_run": "sc-adc05166",                       // human trace-back
  "bug_reference": "https://.../slack-thread",         // human trace-back
  "dataset_split": "base"                              // team filtering
}

The interesting choice is the failure-mode default. Falling back to "grounding" on rows with no human criteria isn't a perfect signal — it tells the judge to apply the generic anti-hallucination criteria, not a case-specific expectation. That's strictly weaker than a well-written expected_behavior. But it's strictly better than —. A row graded against weak criteria still surfaces regressions when the agent starts hallucinating; a row that's silently skipped doesn't surface anything.

What this changes in the team workflow

The visible payoff is the obvious one: the judge now grades every row. Prompt comparisons are real comparisons.

The less visible payoff is a small change in how the team writes new examples. The instruction I posted in the engineering channel is two sentences:

When you add a new example, write the human note in outputs.notes exactly like you've been doing. Then copy the same string into metadata.expected_behavior. If you only have time to do one thing, fill in metadata.failure_mode with "grounding" or "tool-selection" — that's enough to keep the row scored.

This is deliberately not a rule that says "always write structured criteria", because that rule is what other teams I've talked to imposed and then quietly stopped enforcing six weeks later. Nobody writes structured criteria when they're triaging a failing case at 6pm on a Friday — they write a sentence in the place the box accepts a sentence. The pragmatic version asks for the sentence first, the metadata second, and the failure-mode tag last. The first two are duplicates of each other; the third is a cheap escape hatch.

It's also a rule a junior engineer can follow without a training session. The expensive parts of eval discipline — choosing the right failure mode, writing tight evaluation criteria, debating whether the judge over- or under-weights a category — can come later, on the rows that get re-examined when a regression hits. The long-tail rows, the ones that get written once and never looked at again, just need to be visible to the judge.

What I still don't have a clean answer for

Two things.

The first is the sync direction. Right now the script is one-shot — runs once, mirrors current notes into metadata, done. If a teammate edits notes next week, the metadata silently goes stale. The right answer is probably a pre-commit hook on the dataset, or a periodic CI job that fails the PR if a row's notes and metadata.expected_behavior drift. I haven't built it because I want to see whether drift actually happens at our scale (50 rows, four contributors) before adding the machinery. My guess is it won't, and the script-on-demand model is fine for the next quarter. My guess could be wrong.

The second is what to do when the human note is actively misleading for the judge. A teammate writes "this should be a yes" as a note, meaning I, the human auditor, expect the agent to answer yes. Copied into expected_behavior, the judge reads it as the agent must answer "yes" and scores any nuanced answer as a failure. The format the human writes in is not the format the judge wants to read. So far this is rare enough — maybe two of the seventeen rows — that I'm hand-editing them. But it's the seam where the "duplicate, don't migrate" rule starts to leak, and where the team will eventually need a small lint that flags ambiguous notes for rewrite.

If you take one thing

Your eval dataset has two readers — your team and your judge — and they read in different formats. The instinct to collapse them into one canonical schema is the right instinct for production data and the wrong one here. Let humans write where humans write. Let the judge read where the judge reads. Sync forward, in one direction, on a script that runs cheap and often.

Before you trust an eval dashboard, check the skip rate. A judge that silently skips 63% of the rows is not a judge — it's a hallway mirror, showing you only the rows you already knew how to grade.