An adversarial eval is a gate, not a metric
A +15 pp accuracy gain on a 54-example adversarial dataset says little about production accuracy. Why a model migration needs two datasets: a regression gate and a representative metric.
Everything published under this topic, ordered from newest to oldest.
5 posts
A +15 pp accuracy gain on a 54-example adversarial dataset says little about production accuracy. Why a model migration needs two datasets: a regression gate and a representative metric.
At a hackathon we built a content workflow that can't draft until a human approves the topic and a brand check passes. Notes on the two gates, and a source-priority rule for ranking customer signals.
The LLM judge skipped 31 of our 49 eval rows: the human notes lived in a field it doesn't read. Why I duplicated the notes into metadata instead of migrating them.
Notes from symlinking team-shared Claude Code skills with GNU stow: the four entries of ~/.claude/ worth sharing, the --no-folding flag that keeps the rest safe, and a fake $HOME for dry runs.
Our prompt iteration loop was a coin flip, so I built evals around two failure modes: tool selection checked deterministically, grounding graded by an LLM judge from a different model family than the agent.