A week after we shipped the LLM-as-judge into our prompt iteration loop, I ran the first full experiment with it. Forty-nine examples, two repetitions each, judge enabled. The run finished. I opened the LangSmith UI expecting to read forty-nine judgements. Instead, more than half the rows showed grounding_judge: —. No score. No comment. A small note in the feedback: "skipped (no notes/expected_behavior/evaluation_criteria and no failure_mode)."
Thirty-one of forty-nine. The judge had nothing to judge against.
The reflex is to blame the judge code — surely the criteria-extraction is too strict. It wasn't. The judge was reading example.metadata.expected_behavior and example.metadata.failure_mode, both empty on most rows. But the examples were not without information. Almost every one of those skipped rows had a field called notes in its outputs, with sentences like "They do have manufacturing locations in Canada", "The CEO is Markus Halvorsen", "Make sure it's using data about the right company". Human-written, useful, in the wrong place for the machine.
This is a small operational story but it's the one the eval-tutorials skip, and it's the one I want to write down. An evaluation dataset has two readers — the human teammate browsing it and the LLM judge consuming it — and they need different things from the same row. If you only design for one, the other goes silent.
The two readers of an eval dataset
Most eval frameworks I've read about treat the dataset as a programmatic input. Inputs, outputs, metadata, evaluators. Clean. And in the demos that's all there is — someone seeds twenty examples, the judge reads them, the scores come out.
In practice, the dataset is also a shared engineering artefact. The product manager pastes a failing case into it. The on-call engineer adds the run that broke yesterday. The intern auditing the agent leaves a note that says "this should be a yes, the company is clearly headquartered in northern Italy." The dataset is, before anything else, a Slack channel for the team's collective memory of how the agent fails — written in prose, indexed by example.
That prose lives where it's easiest to write it: a freeform notes string in the example's outputs. It's the right place for a human. You open the example, you see your context, you remember what this case was about. For a year that worked fine — we were the only readers.
The moment the LLM-as-judge arrives, the dataset has a second reader who can't see prose unless you tell it exactly where to look. And no judge implementation I've seen, including ours, defaults to reading freeform notes. The judges expect structured criteria: expected_behavior, failure_mode, evaluation_criteria. Strings the prompt template can interpolate without ambiguity.
So you end up with this asymmetry. The team writes truth in prose, in the field that's natural to write prose in. The judge reads truth from structured metadata, in fields that don't exist yet. The two never meet. And because the judge silently skips rows it can't evaluate, you don't notice for weeks — your dashboard shows a clean score on the rows it could grade, and you assume the rest passed.
The instinct that makes things worse
When I first saw the skipped rows, my reflex was the obvious one: migrate the notes into metadata. Move the prose from outputs.notes into metadata.expected_behavior, drop the old field, done. One source of truth.
That instinct was wrong, and I want to be specific about why, because the wrong move here is the one a tidy engineer reaches for first.
The notes are not redundant context the team would happily lose. They are how the team reads the dataset. When a teammate opens an example in the LangSmith UI, they look at the Example tab first. The metadata tab is two clicks away and rendered as a key/value table that's optimised for short strings, not paragraphs. If I move the notes there, three things happen at once:
- The reviewer's workflow breaks — the prose they were scanning has vanished from the place they were scanning.
- The metadata view, which is currently a clean four-field strip, fills up with prose and loses its structural function.
- Worse: the moment a teammate next adds a case, they go back to writing in
notesbecause that's where they always wrote. Now the dataset has prose in both places, half-overlapping, drifting.
The reason this is worth a paragraph is that "single source of truth" is the right rule for production data, and the wrong rule for a shared engineering artefact with two readers that have different access patterns. Pretending the two readers want the same shape forces one of them to lose. Production code can't have two truths because they will diverge under load. A small eval dataset, edited by hand by a small team, can — if the duplication is deliberate, asymmetric, and one direction.
The rule we settled on
What I did instead was duplicate, deliberately, in one direction:
- The
outputs.notesfield stays exactly where it is. Untouched. The team keeps writing there, the LangSmith UI keeps rendering it, the existing workflow is preserved. - For every example that has a non-empty
notes, the same string is copied intometadata.expected_behavior. The judge reads from there. - For every example that has neither,
metadata.failure_modegets a default value ("grounding", in our case — the broadest of our two failure types). This gives the judge enough criteria to evaluate the row against the system's default grounding rules, even when nobody has hand-written a specific expectation. - The judge code only reads metadata. Period. It never inspects
outputs.notes. The duplication is one-way: human writes prose, the prose gets reflected into metadata for the judge.
That's it. Forty-nine rows, three minutes of running an update script, and the next experiment had a score on every row.
The reason this works is that the duplication is directional and bounded. The team has one place to write (notes) and the judge has one place to read (metadata). The sync runs forward only — there's no path where the metadata edit fails to propagate back to notes, because nothing in the loop edits metadata by hand. Eventually that sync becomes a hook on dataset write, or a CI check on the dataset diff. For now it's a script, and that's fine.
The metadata schema we ended up standardizing has five fields, two of them load-bearing for the judge:
{
"failure_mode": "grounding", // judge reads
"expected_behavior": "Agent should fetch X first...",// judge reads
"original_run": "sc-adc05166", // human trace-back
"bug_reference": "https://.../slack-thread", // human trace-back
"dataset_split": "base" // team filtering
}
The interesting choice is the failure-mode default. Falling back to "grounding" on rows with no human criteria isn't a perfect signal — it tells the judge to apply the generic anti-hallucination criteria, not a case-specific expectation. That's strictly weaker than a well-written expected_behavior. But it's strictly better than —. A row that gets graded against weak criteria still surfaces regressions when the agent starts hallucinating; a row that's silently skipped doesn't surface anything.
Each row is a regression test
The two fields I haven't justified yet — original_run and bug_reference — are the ones that matter most six months from now. They are the seam where this dataset stops being a static list of examples and starts behaving like a regression test suite.
The pattern is the same one any engineering team uses for production bugs, just translated into LLM-land.
When a defect lands in code, the mature loop is well-known: reproduce the failure, write a test that fails for the same reason, commit the fix, watch the test go green. The test stays in the suite forever, and its value isn't that it asserts the current behaviour — it's that it asserts this specific bug doesn't come back. A year later, when someone refactors that module and the test goes red, the failure message is a time machine. The PR that introduced the test, the commit that linked to the Jira ticket, the original bug reporter's description — all of that travels with it.
The LLM equivalent works almost exactly the same way. A customer reports a bad answer in Slack. Someone reproduces it. The reproduction goes into the eval dataset as a new example. The fix is a prompt change, a tool change, or a retrieval change. The next experiment runs and the example scores 1 instead of 0. From that point on, the example is permanent. Every future prompt change has to keep that score at 1 to ship.
What makes the analogy hold — and what most eval dataset designs leave out — is the trace back to why. In code, you have git blame plus the PR description plus the linked issue. In an eval dataset, by default, you have a row of inputs and outputs and no idea how it got there. Six months in, that's the same problem as inheriting a 4000-line test file with no comments. You can't tell which tests are load-bearing, which are obsolete, and which exist because of a specific customer escalation.
The two metadata fields are the answer:
bug_referenceis a permalink to the Slack thread, the issue tracker ticket, or the customer email that started the case. Opening the row tells you immediately whose problem this is, who reported it, what the severity was, and what context lives outside the dataset.original_runis the LangSmith trace ID of the actual run that misbehaved. Not a fresh reproduction — the original failing trace, with the original tool calls, the original messages, the original failure mode preserved. When a future regression looks similar but not identical, having the canonical broken trace next to the canonical fixed trace is what makes the diff legible.
Together they turn the dataset into a navigable changelog of how the agent has failed in production. A new engineer joining the team can open any row, click the Slack link, read the original frustration, look at the original trace, and understand why this case is in the suite. That's the same thing a comment on a regression test gives you in code — except the dataset format gives it for free, structurally, on every row, without anyone having to remember to write a comment.
There's one important asymmetry with code tests, and it's worth naming because it changes how you should think about coverage. A code test asserts a deterministic property: this function returns this value. An eval row asserts a probabilistic one: this agent, on this input, produces an answer that satisfies these criteria most of the time. Which is why the repetitions parameter matters — a single run on a flaky example is a single sample of a distribution, and treating it as a pass/fail signal is how teams convince themselves a prompt change worked when it didn't. Two repetitions is the floor. Five is more honest. The cost is linear and the alternative is debugging an imaginary regression for half a day.
The reason I'm spending words on the bridge between Slack defects and eval rows is that this is the part the literature genuinely doesn't cover. Most eval framework docs assume the dataset is given. In practice the dataset is a living artefact, grown one production incident at a time, and the discipline of writing the row when the bug lands — not three weeks later when you have time, not as part of a quarterly cleanup, but right then, with the Slack link still warm — is the thing that decides whether you have an eval suite or a museum of past intentions.
What this changes in the team workflow
The visible payoff is the obvious one: the judge now grades every row. The forty-nine became forty-nine. Prompt comparisons are real comparisons.
The less visible payoff is a small change in how the team writes new examples. The instruction we landed on, posted in the engineering channel, is two sentences:
When you add a new example, write the human note in
outputs.notesexactly like you've been doing. Then copy the same string intometadata.expected_behavior. If you only have time to do one thing, fill inmetadata.failure_modewith"grounding"or"tool-selection"— that's enough to keep the row scored.
This is deliberately not a rule that says "always write structured criteria", because that rule is what other teams I've talked to imposed and then quietly stopped enforcing six weeks later. Nobody writes structured criteria when they're triaging a failing case at 6pm on a Friday — they write a sentence in the place the box accepts a sentence. The pragmatic version asks for the sentence first, the metadata second, and the failure-mode tag last. The first two are duplicates of each other, the third is a cheap escape hatch.
It's also a rule a junior engineer can follow without a training session. The expensive parts of eval discipline — choosing the right failure mode, writing tight evaluation criteria, debating whether the judge over- or under-weights a category — can come later, on the rows that get re-examined when a regression hits. The basic rows, the long tail, the ones that get written once and never looked at again, just need to be visible to the judge.
What I still don't have a clean answer for
Two things.
The first is the sync direction. Right now the script is one-shot — runs once, mirrors current notes into metadata, done. If a teammate edits notes next week, the metadata silently goes stale. The right answer is probably a pre-commit hook on the dataset, or a periodic CI job that fails the PR if a row's notes and metadata.expected_behavior drift. I haven't built it because I want to see whether drift actually happens at our scale (50 rows, four contributors) before adding the machinery. My guess is it won't, and the script-on-demand model is fine for the next quarter. My guess could be wrong.
The second is what to do when the human note is actively misleading for the judge. A teammate writes "this should be a yes" as a note, meaning I, the human auditor, expect the agent to answer yes. Copied into expected_behavior, the judge reads it as the agent must answer "yes" and scores any nuanced answer as a failure. The format the human writes in is not the format the judge wants to read. So far this is rare enough — maybe two of the seventeen rows — that I'm hand-editing them. But it's the seam where the "duplicate, don't migrate" rule starts to leak, and where the team will eventually need a small lint that flags ambiguous notes for rewrite.
If you take one thing
Your eval dataset has two readers — your team and your judge — and they read in different formats. The instinct to collapse them into one canonical schema is the right instinct for production data and the wrong one here. Let humans write where humans write. Let the judge read where the judge reads. Sync forward, in one direction, on a script that runs cheap and often. The metadata is for the machine; the notes are for the next teammate who opens the row at 6pm on a Friday.
And treat each row as a regression test — with a permalink back to the Slack thread that started it. The dataset isn't a static benchmark; it's a living suite that grows one production incident at a time. The discipline that decides whether you have an eval suite or a museum of past intentions is writing the row when the bug lands, while the Slack link is still warm.
Before you trust an eval dashboard, check the skip rate. A judge that silently skips 63% of the rows is not a judge — it's a hallway mirror, showing you only the rows you already knew how to grade.