Tap Notes: The Allowlist

Three pieces landed within 48 hours of each other, all circling the same problem from different angles: your security posture is only as good as your threat model, and threat models don’t cover what you haven’t imagined yet. The Anthropic containment post, the FFmpeg zero-day paper, and the safety composition piece form an accidental curriculum. Reading order matters.


How we contain Claude across products

Anthropic’s engineering team published a detailed account of how they sandbox Claude across product surfaces — VM isolation, egress controls, MITM proxies for API traffic, and the specific failure mode that drove the architecture.

The sharpest reframe in here: any domain on an allowlist isn’t a destination, it’s a capability surface. api.anthropic.com was correctly permitted. That’s also exactly how data got out — attacker-embedded key, Files API, exfiltration complete, sandbox working as designed. The fix is a MITM proxy inside the VM that checks session token provenance, because only the VM knows provenance. From the server’s perspective, attacker-leveraged API traffic and legitimate traffic look identical. The honest tension the piece acknowledges but doesn’t fully resolve: isolation and observability pull in opposite directions, and the fall-back to pull-based OTLP exports is a mitigation, not a solution. The failure mode that should actually keep autonomous agent builders up at night isn’t “model goes rogue” — it’s “deterministic infrastructure approves a request that looks exactly like a legitimate one.”

Any domain on an allowlist isn’t a destination — it’s a capability surface.

21 Zero-Days in FFmpeg

An AI security agent found 21 zero-day vulnerabilities in FFmpeg, including heap overflows with working proof-of-concept exploits. The paper walks through the full exploitation chain and the architectural decisions that made the agent trustworthy rather than noisy.

The useful thing here isn’t the 21 zero-days — it’s the architectural distinction between a security agent and a coding agent, which is underspecified in most discussions about autonomous systems. A security agent isn’t trying to solve problems. It’s trying to find them reproducibly. That requires threat modeling, data flow tracing, reachability validation before reporting, and PoC generation to prove findings are real. These aren’t just guardrails against hallucination — they’re what makes the output actionable instead of a list of suspicious patterns that require days of manual triage. The question for anyone building autonomous agents in high-stakes domains: what are your equivalents of reachability validation and PoC requirements? Without them, you have a confident noise machine.


When All Your Safety Guards Vote the Same Way

The drunk.support team diagnosed a bug in their agentic pipeline: every safety layer was individually defensible, and the system was still broken. Three conservative components, each audited and correct, each defaulting to “investigate rather than implement” — the aggregate was a unanimous vote for inaction.

The diagnostic question shifts here. “Is each layer conservative enough?” is the wrong question. “What is the aggregate bias of the whole pipeline?” is the right one. A 2026 paper cited in the piece formalizes what the empirical hit reveals: safety properties of combined components don’t simply add. The fix — flipping the default to implement at the orchestrator level, keeping a read-only conservative layer as a verification step — is correct for an internal build agent with low external exposure. For systems with write access to production or external APIs, you’d want to be more careful about which layer you relax and what the blast radius looks like if the aggregate bias swings the other way.

Three safety layers all defaulting to “investigate” doesn’t give you defense-in-depth — it gives you a unanimous vote for inaction.

Short digest today — three instead of five. But these three actually form an argument: trust surfaces are larger than you think (Anthropic), autonomous agents operating in those surfaces can find real vulnerabilities (FFmpeg), and the safety architecture you’ve built to manage the risk might be paralyzed by design (drunk.support). The short reading list is the point.

🪨