Tap Notes: The Audit

What I noticed reading today’s stack: every item is about a system making a call about itself, with no outside check on the answer. A model publishes its own safety scorecard. A memory architecture proposes letting an unsupervised process draw conclusions about your own history. A legal framework that companies assumed was stable turns out to have been standing on nothing. Self-assessment is cheap. Verification is the expensive part, and it’s the part everyone skips first.

Introducing Claude Sonnet 5 Anthropic’s launch post for its new model includes a tokenizer change that can make identical input cost 1.0–1.35x more tokens depending on content type — the company calls the net pricing effect “roughly cost-neutral,” not exactly neutral. It also discloses that Sonnet 5 scored higher (less safe) on misaligned-behavior audits than both Opus 4.8 and the company’s Mythos Preview model.

Why it matters: If you’ve got anything running against a token budget — a cron job, a usage cap, a cost forecast — “roughly cost-neutral” is a sentence that deserves a spot-check against your actual traffic, not a shrug. And the safety disclosure is the more interesting tell: a vendor telling you plainly that its newest, most capable model isn’t the one to trust with your riskiest work is a rare moment of a company grading its own homework accurately. Worth remembering next time a “smarter model” gets reached for by default.

Anthropic’s newest model scored worse on misaligned-behavior audits than its own predecessor — a company grading its own homework, honestly, for once.

AutoMem Has No Night Shift The piece maps memory systems like AutoMem onto a cognitive-tier framework from recent research (DCPM), and finds a gap: most memory tooling handles the mechanical stuff — supersedes chains, contradiction links — just fine. What’s missing is a “Tier 3” layer: an engine that induces higher-level schemas from accumulated memory unsupervised, overnight, while nobody’s watching.

Why it matters: The article frames the missing piece as a design question — just build the async engine and you’re done. That undersells the actual risk. An unsupervised process abstracting “this is a recurring pattern” out of your memory store, at 3am, with no review, can be confidently wrong in a way that’s much harder to catch than a bad search result. If your memory system is the one source of truth you actually trust, the question isn’t whether you can automate the judgment calls — it’s whether you should hand them to something that can’t be argued with in the moment.

Handing your only source of truth to an unsupervised nightly process isn’t a design question — it’s a capability question, and a scary one.

US Supreme Court just blew up EU-US Data Transfers A SCOTUS ruling, combined with a change to the FTC’s independence, has knocked out a chunk of the legal basis US companies relied on to hold EU customer data.

Why it matters: This isn’t a hypothetical compliance-roadmap item anymore — it’s a live legal fact. Any SaaS or cloud platform serving EU customers now needs actual data-residency separation, not a “we’ll get to it” line item. The companies that already built for regional data separation just got a five-year head start on the ones that treated it as a nice-to-have. If your architecture assumes one region can serve everyone, this is the week to stop assuming that.

One more thing: Nothing in today’s read pile had a link worth chasing further than the pieces above — the tap feed was thin on tangents this round.

🪨