A few weeks ago our prompt iteration loop had a small but persistent problem: every change to the agent's system prompt was a coin flip. Two of us, looking at the same example in LangGraph Studio, would see different tool calls on different runs. Same input, same commit, different behaviour. The model — GPT-4.1 in our case — is just non-deterministic enough on tool selection that a single run tells you nothing.

We were doing what most teams do at that stage. Edit the prompt in LangSmith. Open Studio. Pick a representative question. Run it. Look at the trace. Run it again. Notice the trace is different. Argue about whether the change helped. Move on to the next question.

That works at the start. It stops working very quickly.

This post is about the eval setup we built to replace that workflow, and the one design decision that ended up mattering more than the rest: the judge that grades the agent must come from a different model family than the agent itself. The rest is plumbing, but that one piece changes whether you can trust the numbers.

The two failure modes that need different evaluators

The first thing the setup forced us to articulate was that not every agent failure looks the same, and that single-metric evaluation hides this fact.

Take two real-ish examples. Both came up in our work, simplified here with fictional companies:

Case A. Domain acmebrew.com, question: "Is this company owned by a group or corporation?" The agent has three tools available: an internal knowledge source, a web-search tool, and a research tool. The right answer requires the web-search tool. Half the runs the agent called the internal source instead and confidently answered "no parent company found." That's a routing failure. The tool selected was wrong.

Case B. Domain nuova-mobility.io, question: "Does this company have a partnership with one of MegaCeram's competitors?" The agent picked the right tool. It then asked the web tool a generic query, got a single article back about a mobility partnership, and declared yes — claiming a competitor relationship that didn't exist (a payments company being labelled a competitor of a ceramics manufacturer because both, the agent reasoned, were "technology companies"). That's a grounding failure. The tool was right. The reasoning hallucinated.

These need different evaluators, and treating them the same is a mistake worth flagging because most "LLM eval" tutorials gloss over it.

  • Type A failures are deterministic to check. The example has an expected_tool in metadata. The evaluator compares it to the actual tool_calls from the run. Done. No LLM needed for the judge, no ambiguity in the score. Cheap and reliable.
  • Type B failures cannot be checked without reading the agent's reasoning. There are infinite valid answers and infinite invalid ones, and the difference often lives in how the agent got there. This is where you need a judge — and where the judge's choice matters.

Once we split the dataset along this axis (we tag examples with failure_mode: "tool-selection" or "grounding"), the two evaluators became orthogonal. A prompt change can improve routing without affecting grounding, or vice versa, and we can see the trade-off in two separate columns instead of arguing about a single aggregate number.

The judge can't be from the same family as the agent

The default temptation, when you want a quick judge, is to reach for whatever model you already have set up. In our case that meant evaluating GPT-4.1's output with GPT-4.1-mini. Same provider, same training family, same tokenizer, same instruction-following style. It worked. It scored things. We almost stopped there.

We didn't, and the reason is worth spelling out because the bias literature on this is solid (and recent work on "preference leakage" shows it extends specifically to the same-family case the rest of this section is about) but rarely operationalized in real eval pipelines: models from the same family share blind spots. If the agent hallucinates that a payments company is a competitor of a ceramics manufacturer because they're both "technology", the mini model from the same family finds that reasoning plausible. It has been trained on the same web text and reinforced with the same RLHF tuning. The errors look like its own errors. The judge approves what it would have produced.

The documented self-preference effect — judges scoring outputs from their own family measurably higher than human raters do — is the well-known piece of this. The less-discussed piece, and the one that bites you on grounding evaluations specifically, is that shared hallucinations don't get caught. A judge from the same family doesn't just over-score; it fails to flag the exact category of error you most need a judge for. For tool-selection that doesn't matter — the comparison is deterministic. For grounding it matters a lot.

So the rule, for us, became simple: the agent runs on OpenAI, the grounding judge runs on something else. We tried two routes.

Route one: Anthropic API direct

The clean version. pip install anthropic, set ANTHROPIC_API_KEY, point the judge at Haiku. Different provider, different family, robust, costs cents per experiment. The only downside is the new dependency and the new key.

Route two: GitHub Models, and a trap I want to warn you about

If you have a GitHub account you already have access to GitHub Models, which exposes an OpenAI-compatible inference endpoint at https://models.github.ai/inference. You authenticate with a PAT carrying the models:read scope, you point your existing OpenAI SDK at a different base URL, and you get a free tier (with daily rate limits). No new dependency. That's appealing.

Here's the trap. Look at the GitHub Copilot models settings UI and you'll see Claude Haiku, Claude Sonnet, Claude Opus, all listed as available. So you write model="anthropic/claude-haiku-4-5" in your script, run it, and get back:

NotFoundError: Error code: 404 - {'error': {'code': 'unknown_model', 'message': 'Unknown model: anthropic/claude-haiku-4-5'}}

The Claude models in the Copilot settings UI are served by a different GitHub product — the Copilot chat endpoint — that is not a public API. They are not in the GitHub Models public catalog. If you query https://models.github.ai/catalog/models directly you'll see only OpenAI, AI21, Cohere, DeepSeek, Meta, Mistral, xAI, and Microsoft families. No Anthropic. No Google.

This took us an embarrassing hour to figure out. I'm writing it down so it costs you ten minutes instead.

The pragmatic answer, if you want to stay on GitHub Models: pick the strongest non-OpenAI model that's actually in the catalog. We landed on meta/llama-3.3-70b-instruct. It's a different family from GPT-4.1, it supports response_format={"type": "json_object"} so the judge's structured output works without surgery, and it has a free tier we haven't hit the limits of for normal iteration use.

The script picks the provider automatically: if GITHUB_TOKEN is set, it routes the judge to GitHub Models with Llama 3.3 70B. If the call fails — rate limit, auth, model unavailable — it falls back to OpenAI's gpt-4.1-mini for that single run, logs the fallback visibly, and records which provider judged each run in the feedback comment so a mixed-provider experiment doesn't quietly mislead you.

What this actually saves the team

The motivation was never sophistication. It was time. Specifically, the time the team was spending arguing about prompt changes that no one could confidently evaluate.

Before:

  1. Engineer edits prompt in LangSmith, creates a new commit.
  2. Engineer opens LangGraph Studio, runs a representative example, looks at the trace.
  3. Posts a screenshot in Slack: "I think this fixes the acmebrew case."
  4. Someone else runs the same example, gets a different trace, posts another screenshot.
  5. They spend forty minutes deciding whether the difference is the prompt change or model noise.
  6. The prompt change ships anyway because the deadline is real.

After:

make evaluate-compare ARGS="--a=:prev_commit --b=:new_commit --example-filter=ownership --repetitions=5 --judge"

Five runs per commit per example, both evaluators active, results aggregated, LangSmith compare URL printed at the end. Five minutes of wall time, three or four cents in API calls, a number both engineers can point at and a side-by-side trace view to drill into the disagreements.

The shift is small but real. The conversation stops being "I think it works" and starts being "the grounding judge went from 2/5 to 5/5 on the nuova-mobility case and tool selection didn't regress." That's a sentence you can put in a PR description. The reviewer can challenge it on its terms — wrong dataset, too few reps, judge prompt has a bias — instead of arguing about screenshots.

Worth saying: the eval setup also surfaced cases where what we thought was a prompt bug was actually model flakiness. Same commit, five runs, three pass and two fail. That tells you the prompt change you were about to ship to "fix" it would have been chasing noise. It's a humbling thing to discover, and it's the kind of thing you don't notice when you only run an example once.

Growing the coverage without growing the pain

Once the harness works, the dataset itself becomes the bottleneck. We started with twenty examples, mostly Type A, mostly hand-picked from existing bug reports. The question that matters then is how you grow coverage without the dataset becoming a maintenance burden.

A few choices that have aged well so far:

  • Every reported bug becomes an example. When a stakeholder reports a hallucination, the first action is adding the failing input to the eval dataset with a failure_mode tag and, for Type B cases, a one-sentence notes field that the judge will read as evaluation criteria. The bug fix gets validated against a real regression test that lives forever.
  • Criteria live next to the example, not in code. The judge reads its rubric from the example's metadata — notes, expected behavior, evaluation criteria — concatenated together. Adding a new evaluation dimension means editing one example in the LangSmith UI, not writing Python. Non-engineering teammates (product, the actual subject matter experts on the company-info domain) can author these.
  • Splits replace flag soup. Tagging examples (base, tool-selection, grounding, websearch-hallucination, dev, test) lets you run a subset during iteration and the full set before promoting a prompt. The same dataset serves both purposes without forking.
  • Five repetitions is the floor for grounding evals. One run is noise. Three runs is suggestive. Five is the smallest number where a 2/5 to 5/5 delta starts to feel like signal. We pay the latency for it; it's still under a minute end-to-end for a single example.

Things I still don't have good answers for

This setup is two months old. Some things I'm watching:

  • The judge prompt drifts too. We've already tweaked it twice as new failure modes appeared. Versioning judge prompts the same way we version agent prompts is the obvious move, but it adds a layer of A/B that nobody's currently doing. At some point comparing two agent commits will require pinning the judge commit, which is a meta-evaluation problem I haven't seen written about well.
  • Human calibration cadence. The literature says you should spot-check 5-10% of judge verdicts manually. We don't, yet. The risk is the judge develops a blind spot we don't notice because we stopped looking.
  • Cost grows non-linearly with dataset size. Forty-seven examples × five repetitions × two prompt commits with judge on is currently a five-dollar experiment. At three hundred examples it's a thirty-dollar experiment, and someone will eventually want to wire it into CI. That changes the math.

None of these are blockers. They're the kind of problem you only get once the first version works. Which, for a setup that started as a way to stop arguing about screenshots, feels like progress.

If you take one thing

The shape of the work is more important than the tooling. The shift from "I'll just try it in the playground" to "I'll run a small experiment and we'll compare" is a discipline that pays back the day after you adopt it. You don't need a thousand examples or a sophisticated evaluation framework. You need:

  1. A dataset that grows by one example every time something breaks in production.
  2. A deterministic evaluator for the failures that can be checked deterministically.
  3. An LLM judge for the failures that can't — from a model family different from your agent's, with a rubric that lives on the example, with repeated runs to wash out noise.
  4. A one-line command to compare two commits.

That's the whole thing. Everything else is a refinement of those four pieces. The team you're saving time for is your team a month from now, when someone proposes a prompt change at 5pm on a Friday and you have ninety seconds to decide whether it actually improves anything.