An adversarial eval is a gate, not a metric
A +15 pp accuracy gain on a 54-example adversarial dataset says little about production accuracy. Why a model migration needs two datasets: a regression gate and a representative metric.
Everything published under this topic, ordered from newest to oldest.
3 posts
A +15 pp accuracy gain on a 54-example adversarial dataset says little about production accuracy. Why a model migration needs two datasets: a regression gate and a representative metric.
The LLM judge skipped 31 of our 49 eval rows: the human notes lived in a field it doesn't read. Why I duplicated the notes into metadata instead of migrating them.
Our prompt iteration loop was a coin flip, so I built evals around two failure modes: tool selection checked deterministically, grounding graded by an LLM judge from a different model family than the agent.