An adversarial eval is a gate, not a metric
A +15 pp accuracy gain on a 54-example adversarial dataset says little about production accuracy. Why a model migration needs two datasets: a regression gate and a representative metric.
Everything published under this topic, ordered from newest to oldest.
9 posts
A +15 pp accuracy gain on a 54-example adversarial dataset says little about production accuracy. Why a model migration needs two datasets: a regression gate and a representative metric.
At a hackathon we built a content workflow that can't draft until a human approves the topic and a brand check passes. Notes on the two gates, and a source-priority rule for ranking customer signals.
The LLM judge skipped 31 of our 49 eval rows: the human notes lived in a field it doesn't read. Why I duplicated the notes into metadata instead of migrating them.
Notes from symlinking team-shared Claude Code skills with GNU stow: the four entries of ~/.claude/ worth sharing, the --no-folding flag that keeps the rest safe, and a fake $HOME for dry runs.
Our prompt iteration loop was a coin flip, so I built evals around two failure modes: tool selection checked deterministically, grounding graded by an LLM judge from a different model family than the agent.
Daniel H. Pink’s Drive takes a fascinating look at what truly motivates us, flipping conventional wisdom on its head. Pink explores the evolution of human motivation and challen...
TL;DR Systems thinking is a powerful approach to solving complex problems by recognizing patterns, thinking holistically, and balancing technical and social factors. Key traits ...
Hi! As when we start learning an instrument, we cannot think we can start playing Jimmy Page's riff with a couple of hours of practice, we need to interiorize the ideas and firs...
TL;DR A company goal is not to make people happy, but safety when they're doing something they like or break and feel improvement in the process. I wrote this several months ago...