A few weeks ago our prompt iteration loop had a problem: every change to the agent's system prompt was a coin flip. I'd open an example in LangGraph Studio, see one set of tool calls, run the same example again, see a different set. Same input, same commit, different behaviour. GPT-4.1, our agent's model, is just non-deterministic enough on tool selection that a single run tells you nothing.
The team was doing what every team does at that stage. Edit a prompt in LangSmith. Run a representative question in Studio. Look at the trace. Run it again. Notice the trace is different. Argue in Slack about whether the change helped. Ship anyway because the deadline is real.
I built an eval setup to replace that workflow. Most of it is standard plumbing. The choice that turned out to matter is that the judge grading the agent comes from a different model family than the agent itself.
Two failure modes that need different evaluators
Building the dataset forced me to notice that our failures came in two shapes. Two real cases, with fictional companies:
Case A. Domain acmebrew.com, question: "Is this company owned by a group or corporation?" The agent has three tools: an internal knowledge source, web search, and a research tool. The right answer requires web search. Half the runs, the agent called the internal source and confidently answered "no parent company found." Wrong tool. Routing failure.
Case B. Domain nuova-mobility.io, question: "Does this company have a partnership with one of MegaCeram's competitors?" The agent picked the right tool, asked it a generic query, got one article back about a mobility partnership, and declared yes, labelling a payments company a competitor of a ceramics manufacturer because both, it reasoned, were "technology companies". Right tool, wrong reasoning. Grounding failure.
Type A failures are deterministic to check: the example carries an expected_tool in its metadata and the evaluator compares it against the run's tool_calls. No LLM needed. Type B failures can't be checked without reading the agent's reasoning, and that's where you need an LLM judge.
I tagged each example failure_mode: "tool-selection" or "grounding" and gave each its own evaluator. A prompt change can now improve routing without touching grounding, or the reverse, and the trade-off shows up as two columns instead of one aggregate number.
The judge can't share a family with the agent
My first judge was GPT-4.1-mini, because the key was already configured. Same provider as the agent, same training family, same instruction-following style. It worked, it produced scores, and I almost stopped there.
The problem is that models from the same family share blind spots. If the agent hallucinates that a payments company competes with a ceramics manufacturer because both are "technology companies", a mini model from the same family finds that reasoning plausible: same web text, same RLHF flavour. The judge approves what it would have produced itself. The self-preference effect is measured (judges score outputs from their own family higher than human raters do), and work on preference leakage covers the same-family case specifically; the broader survey is a good map of the bias landscape.
For tool selection none of this matters, since the comparison is deterministic. For grounding, the shared hallucinations are exactly the errors you bought a judge to catch.
So the agent runs on OpenAI and the grounding judge runs on something else. Two routes worked for me.
Anthropic API direct
The clean version: pip install anthropic, set ANTHROPIC_API_KEY, point the judge at Haiku. Different family, cents per experiment. Cost: one new dependency and one new key.
GitHub Models
If you'd rather skip the extra dependency, GitHub Models gives you an OpenAI-SDK-compatible endpoint at https://models.github.ai/inference with a PAT (models:read scope) and free-tier rate limits. One footgun: Claude appears in the GitHub Copilot settings UI, but Copilot is a separate product and Claude is not in the GitHub Models catalog. https://models.github.ai/catalog/models is the source of truth for what's callable. meta/llama-3.3-70b-instruct has been my default: different family from GPT-4.1, JSON mode, free tier I haven't exhausted in normal iteration.
Whichever route: put the provider behind an env var, log which judge graded each run, and don't mix providers within an experiment.
Three weeks later
Back to the coin flip. The old loop was: edit prompt, run an example in Studio, screenshot the trace, post it in Slack, someone else gets a different trace, argue for forty minutes, ship anyway. The new loop is one command:
make evaluate-compare ARGS="--a=:prev_commit --b=:new_commit --example-filter=ownership --repetitions=5 --judge"
Five runs per commit per example, both evaluators, results aggregated, LangSmith compare URL at the end. Five minutes of wall time and a few cents. The conversation stops being "I think it works" and becomes "the grounding judge went from 2/5 to 5/5 on the nuova-mobility case and tool selection didn't regress", which a reviewer can challenge on its own terms: wrong dataset, too few repetitions, biased judge prompt.
The setup also surfaced cases where what looked like a prompt bug was model flakiness. Same commit, five runs, three pass, two fail. The prompt change I was about to ship to "fix" one of those would have been chasing noise. You don't find that out running an example once.
Things I still don't have good answers for
- The judge prompt drifts too. I've tweaked it twice as new failure modes appeared. At some point comparing two agent commits will require pinning the judge commit, and I haven't seen that meta-evaluation problem written about well.
- Human calibration. You're supposed to spot-check a slice of judge verdicts manually. I don't, yet. The risk is the judge develops a blind spot I don't notice because I stopped looking.
- Cost. Forty-seven examples × five repetitions × two commits with the judge on is a five-dollar experiment today. At three hundred examples, someone will want it in CI, and the math changes.
You don't need a thousand examples or a framework. A dataset that grows by one example every time something breaks, a deterministic evaluator for what can be checked deterministically, a judge from a different family for what can't, and a one-line command to compare two commits.