A week after the LLM-as-judge went into our prompt iteration loop, I ran the first full experiment with it. Forty-nine examples, two repetitions each, judge enabled. I opened the LangSmith UI expecting forty-nine judgements and got grounding_judge: — on more than half the rows, with a note in the feedback: "skipped (no notes/expected_behavior/evaluation_criteria and no failure_mode)."
Thirty-one of forty-nine. The judge had nothing to judge against.
It wasn't the judge code. The judge reads example.metadata.expected_behavior and example.metadata.failure_mode, both empty on most rows. But the examples weren't empty. Almost every skipped row had a notes field in its outputs with sentences like "They do have manufacturing locations in Canada" or "Make sure it's using data about the right company". Human-written, useful, and in the wrong place for the machine.
The two readers of an eval dataset
Eval frameworks treat the dataset as programmatic input: inputs, outputs, metadata, evaluators. In practice it's also a shared engineering artefact. The product manager pastes a failing case into it. The on-call engineer adds the run that broke yesterday. Someone auditing the agent leaves "this should be a yes, the company is clearly headquartered in northern Italy." That prose lives where it's easiest to write: a freeform notes string in the example's outputs. For a year that worked fine, because we were the only readers.
The judge is a second reader, and it can't see prose unless you tell it exactly where to look. Judges expect structured criteria — expected_behavior, failure_mode, evaluation_criteria — strings a prompt template can interpolate. So the team writes truth in prose while the judge reads truth from metadata fields that don't exist yet. And because the judge silently skips rows it can't evaluate, you don't notice for weeks: the dashboard shows a clean score on the rows it could grade, and you assume the rest passed.
Why I didn't migrate the notes
My first reflex was the tidy one: move the prose from outputs.notes into metadata.expected_behavior, drop the old field, one source of truth.
That would have broken the human reader. When a teammate opens an example in LangSmith they look at the Example tab. Metadata is two clicks away, rendered as a key/value table built for short strings, not paragraphs. Move the notes there and the prose disappears from the place people scan. And the next time someone adds a case they'll write in notes anyway, because that's where they've always written — then there's prose in both places, half-overlapping, drifting.
"Single source of truth" is the right rule for production data, which diverges under load. A fifty-row dataset edited by hand by four people can carry deliberate duplication, as long as it only flows in one direction.
Duplicate forward
outputs.notesstays untouched. The team keeps writing there, the UI keeps rendering it.- Every example with a non-empty
notesgets the same string copied intometadata.expected_behavior. The judge reads from there. - Examples with neither get
metadata.failure_mode: "grounding"as a default, enough for the judge to apply the system's generic grounding rules. - The judge only reads metadata. It never inspects
outputs.notes. Humans write prose; a script reflects it forward.
Three minutes of running an update script, and the next experiment had a score on all forty-nine rows.
The metadata schema settled on five fields, two of them load-bearing for the judge:
{
"failure_mode": "grounding", // judge reads
"expected_behavior": "Agent should fetch X first...",// judge reads
"original_run": "sc-adc05166", // human trace-back
"bug_reference": "https://.../slack-thread", // human trace-back
"dataset_split": "base" // team filtering
}
The failure-mode default is the interesting choice. "grounding" on a row with no human criteria tells the judge to apply generic anti-hallucination checks rather than a case-specific expectation. That's weaker than a written expected_behavior, but a row graded against weak criteria still surfaces regressions when the agent starts hallucinating. A skipped row surfaces nothing.
What this changes in the team workflow
The instruction I posted in the engineering channel is two sentences:
When you add a new example, write the human note in
outputs.notesexactly like you've been doing. Then copy the same string intometadata.expected_behavior. If you only have time to do one thing, fill inmetadata.failure_modewith"grounding"or"tool-selection"— that's enough to keep the row scored.
It's deliberately not "always write structured criteria". Nobody writes structured criteria while triaging a failing case at the end of the day; they write a sentence in the box that accepts a sentence. So the rule asks for the sentence first, the copy second, and the failure-mode tag as the minimum that keeps a row scored. A junior engineer can follow it without a training session, and the expensive parts of eval discipline can wait for the rows that get re-examined when a regression hits.
What I still don't have a clean answer for
Sync direction. The script is one-shot. If a teammate edits notes next week, the metadata silently goes stale. A pre-commit hook or a CI check on drift is the obvious fix; at fifty rows and four contributors I'd rather see drift actually happen before building the machinery.
Notes that mislead the judge. A teammate writes "this should be a yes" meaning I, the auditor, expect a yes. Copied into expected_behavior, the judge reads it as the agent must answer "yes" and fails any nuanced answer. Two of seventeen rows so far; I'm hand-editing them. This is the seam where duplicate-don't-migrate leaks, and where a small lint flagging ambiguous notes will eventually be needed.
If you take one thing
Let humans write where humans write, let the judge read where the judge reads, and sync forward with a cheap script. And before you trust an eval dashboard, check the skip rate — ours was 63% and the dashboard looked fine.