The eval dataset has two readers
Half our eval rows were silently skipped by the LLM judge. The fix wasn't to migrate the notes — it was to duplicate them forward, so the team and the judge can each read in the format that works for them.
Everything published under this topic, ordered from newest to oldest.
2 posts
Half our eval rows were silently skipped by the LLM judge. The fix wasn't to migrate the notes — it was to duplicate them forward, so the team and the judge can each read in the format that works for them.
Prompt iteration in LangGraph Studio breaks the moment model non-determinism shows up. Here's the eval setup that fixes it — and the one design decision that decides whether your numbers mean anything.