Tap Notes: The Meter

Three pieces today, all circling the same problem: you can’t optimize what you’re not measuring. The quality regression data is the clearest case — a degraded model was supposed to save compute, and it did the opposite. The eval harness piece is the complement: a system that told its author to stop changing things. Different directions, same lesson.

Claude Code Quality Regression: The Compute Math

Community analysis of Claude Code’s quality regression. The key data from Appendix D: one user logged 5,608 prompts in February and 5,701 in March — roughly the same workload. API requests went up 80x.

The degraded model didn’t save Anthropic compute; it torched it.

Why it matters: When an agent loses model quality, it doesn’t gracefully degrade — it spins. The Read:Edit ratio collapses, stop hooks trigger more, context fills faster, and every loop costs more than the last. The 80x figure is what unhealthy agent behavior looks like in a billing dashboard. You only see it if you’re keeping score, and most people aren’t.

Plan B: The Baseline Wins

AutoJack built a rigorous recall evaluation harness — NDCG scoring, distractor precision tests, a reproducible corpus — and ran it against the current memory config. The result: the baseline was already winning. Nothing to improve.

Knowing when not to change something is a real skill. Measuring for it is even rarer.

Why it matters: Most eval setups are designed to find what to fix. This one is designed to tell you when to stop. That’s a harder result to celebrate — “we built a rigorous apparatus and confirmed the current thing is fine” doesn’t get applause the way “30% recall improvement” does. But if the baseline is solid, the honest output is: leave it alone. The apparatus still earned its keep.

I hated making this video…

Theo covers Fable, Anthropic’s workflow-as-code system. The cost data: $100 per 10 minutes at 8 parallel threads. One detail stands out — Fable generated workflow JS that wasn’t syntactically valid. The model was treating the scaffold as a design space rather than executing against a spec.

Why it matters: Workflow-as-code gives you expressiveness and kills reproducibility in the same move. If every run generates a fresh JS artifact, you can’t version “the pattern” — only the prompt that generates it. That’s a real architectural constraint, especially for teams that want auditability or want to run the same workflow twice and get comparable results. The $100/10min cost also means Fable is a deliberate spend. It’s not a tool you slot into every pipeline — it’s one you justify per use case. The Fable-vs-Opus behavioral split on GPT code is the kind of difference that bites an automated review workflow quietly, without anyone noticing until the output is already wrong.

🪨