A few weeks ago our prompt iteration loop had a small but maddening problem: every change to the agent's system prompt was a coin flip. I'd open an example in LangGraph Studio, see one set of tool calls, run the same example again, see a different set. Same input, same commit, different behaviour. GPT-4.1 — our agent's model — is just non-deterministic enough on tool selection that a single run tells you nothing.

The team was doing what every team does at that stage. Edit a prompt in LangSmith. Open Studio. Pick a representative question. Run it. Look at the trace. Run it again. Notice the trace is different. Argue in Slack about whether the change helped. Ship anyway because the deadline is real.

That works at the start. It stops working very quickly.

This post is about the eval setup I built to replace that workflow, and the one design choice that decides whether your numbers mean anything: the judge that grades the agent must come from a different model family than the agent itself. The rest is plumbing.

Two failure modes that need different evaluators

The first thing the setup forced me to articulate is that not every agent failure looks the same, and single-metric evaluation hides that.

Two examples, both from real cases, simplified with fictional companies:

Case A. Domain acmebrew.com, question: "Is this company owned by a group or corporation?" The agent has three tools available: an internal knowledge source, a web-search tool, and a research tool. The right answer requires the web-search tool. Half the runs the agent called the internal source and confidently answered "no parent company found." Wrong tool. Routing failure.

Case B. Domain nuova-mobility.io, question: "Does this company have a partnership with one of MegaCeram's competitors?" The agent picked the right tool. It then asked the web tool a generic query, got a single article back about a mobility partnership, and declared yes — claiming a competitor relationship that didn't exist (a payments company labelled a competitor of a ceramics manufacturer because both, the agent reasoned, were "technology companies"). Right tool. Wrong reasoning. Grounding failure.

These need different evaluators, and treating them the same is the mistake most "LLM eval" tutorials gloss over.

  • Type A failures are deterministic to check. The example carries an expected_tool in metadata. The evaluator compares it to the actual tool_calls from the run. Done. No LLM needed for the judge, no ambiguity in the score.
  • Type B failures can't be checked without reading the agent's reasoning. Infinite valid answers, infinite invalid ones, and the difference lives in how the agent got there. This is where you need an LLM judge — and where the judge's choice starts mattering.

Once I split the dataset along this axis (each example tagged failure_mode: "tool-selection" or "grounding"), the two evaluators became orthogonal. A prompt change can improve routing without affecting grounding, or vice versa, and the trade-off is visible in two separate columns instead of being buried in a single aggregate number.

The judge can't share a family with the agent

The default temptation, when you want a quick judge, is to reach for whatever model you already have set up. In my case that meant evaluating GPT-4.1's output with GPT-4.1-mini. Same provider, same training family, same tokenizer, same instruction-following style. It worked. It scored things. I almost stopped there.

I didn't, and the reason is worth spelling out because the bias literature on this is solid (and recent work on "preference leakage" shows it extends specifically to the same-family case I almost shipped) but rarely operationalized in real eval pipelines: models from the same family share blind spots. If the agent hallucinates that a payments company is a competitor of a ceramics manufacturer because they're both "technology", the mini model from the same family finds that reasoning plausible. It has been trained on the same web text and reinforced with the same RLHF tuning. The errors look like its own errors. The judge approves what it would have produced.

The documented self-preference effect — judges scoring outputs from their own family measurably higher than human raters do — is the well-known piece. The less-discussed piece, and the one that bites you on grounding evaluations specifically, is that shared hallucinations don't get flagged. A judge from the same family doesn't just over-score; it fails to surface the exact category of error you most need a judge for. For tool-selection this is a non-issue, since the comparison is deterministic. For grounding it's the whole game.

So the rule, for me, became simple: the agent runs on OpenAI, the grounding judge runs on something else. Two routes worked.

Route one: Anthropic API direct

The clean version. pip install anthropic, set ANTHROPIC_API_KEY, point the judge at Haiku. Different provider, different family, robust, cents per experiment. Only downside: new dependency, new key.

Route two: GitHub Models

If you'd rather skip the Anthropic dependency, GitHub Models is a viable free-tier alternative — OpenAI-SDK-compatible endpoint at https://models.github.ai/inference, PAT with models:read scope, daily rate limits on the free tier. One footgun: Claude shows up in the GitHub Copilot settings UI, but Copilot chat is a separate product and Claude is not in the GitHub Models public catalog. https://models.github.ai/catalog/models is the source of truth for what's actually callable. Pick the strongest non-OpenAI option in there — meta/llama-3.3-70b-instruct has been my default. Different family from GPT-4.1, supports JSON mode, free tier I haven't hit in normal iteration.

Whichever route, route the provider behind an env var, log which judge graded each run, and don't silently mix providers in the same experiment.

What changed on the next Friday afternoon

Back to the coin-flip problem, three weeks later. The conversation has shifted.

Before, the loop was: edit prompt → run example in Studio → screenshot the trace → post it in Slack → someone else runs the same example → different trace → argue for forty minutes about whether it was the prompt change or model noise → ship anyway.

After, the loop is a single command:

make evaluate-compare ARGS="--a=:prev_commit --b=:new_commit --example-filter=ownership --repetitions=5 --judge"

Five runs per commit per example, both evaluators active, results aggregated, LangSmith compare URL printed at the end. Five minutes of wall time, a few cents in API calls, a number both engineers can point at and a side-by-side trace view to drill into the disagreements.

The shift is small but real. The conversation stops being "I think it works" and starts being "the grounding judge went from 2/5 to 5/5 on the nuova-mobility case and tool selection didn't regress." That's a sentence you can put in a PR description. The reviewer can challenge it on its terms — wrong dataset, too few reps, judge prompt has a bias — instead of arguing about screenshots.

Worth saying: the eval setup also surfaced cases where what looked like a prompt bug was actually model flakiness. Same commit, five runs, three pass and two fail. The prompt change I was about to ship to "fix" it would have been chasing noise. It's a humbling thing to discover, and it's the kind of thing you don't notice when you only run an example once.

Things I still don't have good answers for

This setup is two months old. Some things I'm watching:

  • The judge prompt drifts too. I've already tweaked it twice as new failure modes appeared. Versioning judge prompts the same way I version agent prompts is the obvious move, but it adds a layer of A/B that nobody's currently doing. At some point comparing two agent commits will require pinning the judge commit, which is a meta-evaluation problem I haven't seen written about well.
  • Human calibration cadence. The literature says you should spot-check 5–10% of judge verdicts manually. I don't, yet. The risk is the judge develops a blind spot I don't notice because I stopped looking.
  • Cost grows non-linearly with dataset size. Forty-seven examples × five repetitions × two prompt commits with judge on is currently a five-dollar experiment. At three hundred examples it's a thirty-dollar experiment, and someone will eventually want to wire it into CI. That changes the math.

If you take one thing

You don't need a thousand examples or a sophisticated framework. You need four pieces:

  1. A dataset that grows by one example every time something breaks in production.
  2. A deterministic evaluator for the failures that can be checked deterministically.
  3. An LLM judge for the failures that can't — from a model family different from your agent's, with a rubric that lives on the example, and repeated runs to wash out noise.
  4. A one-line command to compare two commits.

Everything else is refinement. The team you're saving time for is your team a month from now, when someone proposes a prompt change at 5pm on a Friday and you have ninety seconds to decide whether it actually improves anything.