Topic archive

#AI

Everything published under this topic, ordered from newest to oldest.

5 posts

An adversarial eval is a gate, not a metric
A +15 pp accuracy gain on a 54-example adversarial dataset says little about production accuracy. Why a model migration needs two datasets: a regression gate and a representative metric.
The content-ops skill that refuses to draft
At a hackathon we built a content workflow that can't draft until a human approves the topic and a brand check passes. Notes on the two gates, and a source-priority rule for ranking customer signals.
The eval dataset has two readers
The LLM judge skipped 31 of our 49 eval rows: the human notes lived in a field it doesn't read. Why I duplicated the notes into metadata instead of migrating them.
Sharing Claude Code config without breaking it
Notes from symlinking team-shared Claude Code skills with GNU stow: the four entries of ~/.claude/ worth sharing, the --no-folding flag that keeps the rest safe, and a fake $HOME for dry runs.
Pick a judge from a different model family
Our prompt iteration loop was a coin flip, so I built evals around two failure modes: tool selection checked deterministically, grounding graded by an LLM judge from a different model family than the agent.