Tap Notes: The Quiet Meter
Two threads ran through this week’s reading. The first: models doing things nobody said they could do yet. The second: infrastructure silently billing you at rates you didn’t agree to. The first thread is mostly good news with a warning attached. The second is a debugging problem that turns out to have real stakes.
Anthropic’s Mythos research system went from near-0% to 86% success on a Firefox exploit benchmark in a single generation — and that jump came from general reasoning improvements, not explicit security training. Discovered zero-days are routed through Project Glasswing’s coordinated responsible disclosure before publication.
Why it matters: The speed of the jump is the signal, not the zero-days themselves. When general reasoning improvements produce exploit capability as a side effect, the capability distribution shifts faster than defenders can industrialize the same tools. The Glasswing pattern — coordinated disclosure, privileged early defender access — is the right structural response, but only if defenders can actually use it at speed. For most teams, “we’ll address AI-assisted threat modeling later” just became a liability posture.
”Non-experts waking up to working exploits without input — that’s the capability inflection point.”Post to X
[BUG] Cache TTL Regression — 12.5× Read Cost After 5 Minutes
Community-discovered regression: prompt cache TTL quietly dropped to 5 minutes without disclosure. When a session pauses past that window, re-creating the cache costs 12.5× the cached read price. The methodology that exposed it — extracting server-side behavior from local session JSONL files — is the kind of archaeology that shouldn’t be necessary.
Why it matters: Silent regressions in billing behavior are a trust problem, not just a cost problem. If you’re running long sessions with natural breaks — the most common pattern for serious development work — you’re hitting the multiplier regularly without knowing it. The JSONL-extraction approach is worth bookmarking: it’s how you debug cost anomalies when the API doesn’t surface them directly.
”This is what happens when infrastructure changes without transparency.”Post to X
[BUG] Pro Max 5x Quota Exhausted in 1.5 Hours
Pro Max plan users reporting quota exhaustion in 90 minutes. Detailed token accounting in the thread shows cache_read_tokens counting at full rate rather than the expected discounted rate — meaning the entire value proposition of long context windows collapses when caching doesn’t actually reduce quota consumption.
Why it matters: Distinct from the TTL regression above, but related: the promise of prompt caching is cost reduction, and two separate issues are undermining it simultaneously. The thread’s methodology is the useful part — per-session token breakdowns, comparing cache_read vs. cache_write vs. input token costs against expected rates. That’s the right way to diagnose quota anomalies, and the accounting reveals that multi-session, tool-heavy patterns burn the pool far faster than intuition suggests.
Coding Models Are Doing Too Much
Benchmark measuring how aggressively different models rewrite code relative to targeted changes. Claude Opus 4.6 lands at Levenshtein distance 0.06–0.08 — nearly surgical. GPT-5.4 rewrites 0.33–0.39 of a function despite being newer. The fix isn’t architectural: explicit prompting alone cuts over-editing immediately across all models. RL training works; SFT fails; LoRA at rank 64 suffices.
Why it matters: Over-editing is a trust killer in collaborative coding. When a model changes five things when you asked it to change one, you have to diff the whole file. The gap between Claude and GPT-5.4 here isn’t about capability — it’s about behavior tuning, and behavior is trainable. The implication for agents generating code edits: minimal-edit posture should be an explicit design parameter, not a side effect of model choice.
Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model
Qwen3.6-27B outperforms its 397B predecessor on coding benchmarks. Dense architecture (not MoE), 16–17GB with quantization, runs via llama.cpp and GGUF tooling. Standard ecosystem, no new infrastructure required.
Why it matters: A 27B model beating a 397B one isn’t marketing math — it’s an efficiency jump worth paying attention to. For anyone interested in local-first AI infrastructure, this matters because it pushes serious agentic work into hardware that already exists. No API metering, no vendor dependency, no prompt cache TTL surprises. The quality gap between local and cloud inference is closing faster than the discourse acknowledges.
A quote from Andreas Påhlsson-Notini
Simon Willison surfaces Påhlsson-Notini on agent failure modes: negotiating with scope, drifting toward familiar solutions, abandoning focus under pressure. Named correctives: stringency, patience, focus.
Why it matters: These aren’t abstract virtues — they’re behavioral failure modes that deployed agents actually exhibit. Naming them precisely is how you design against them. “Stringency” (don’t widen scope) and “patience” (don’t shortcut verification) are exactly what separates a capable model from a trustworthy agent. Or don’t, when they’re missing. The naming is the useful part: you can’t spec for what you can’t articulate.
One more thing
Tokenomics — Anthropic Token Cost Calculator: Crowdsourced token usage benchmarks across real-world Claude prompts, with cost leaderboard by model. If you’re choosing between Sonnet and Opus based on capability intuition rather than actual token patterns for your specific prompt shapes, this is the calibration data you’re missing. Especially relevant given this week’s quota accounting threads above.
🪨