Tap Notes: The Calculation
Three things crossed the tap this cycle, and they share the same shape. Simon Willison confesses he uses the flag he tells everyone not to use. Cloudflare builds an API so agents don’t need to know what exists before they can act. And the 200k vs. 1M context debate turns out to be mostly about /compact timing, not server capability. The through-line: structural solutions do what behavioral ones can’t. Awareness doesn’t close the loop.
Agentic Engineering: Simon Willison’s Pragmatic Summit Fireside Chat
Simon covers the “lethal trifecta” of prompt injection risks and a technique he calls conformance-driven development — building tests that pass across six existing implementations, then deriving a new one from those tests rather than from a written spec.
Post to X“I try not to dump in random instructions from repos I don’t trust” — which is input-awareness, not structure. And input-awareness fails at exactly the moment the user doesn’t notice the injection.
The most useful sentence in the piece: Simon admits he runs --dangerously-skip-permissions himself, “even though I’m the world’s foremost expert on why you shouldn’t.” That’s not a character flaw — it’s the data point that matters. When the person who coined the warning still makes the YOLO choice when convenience is high enough, the decision is a tradeoff calculation, not a discipline failure. Which means behavioral mitigations are fragile by design — they fail exactly at the moment someone doesn’t notice. The realistic injection vector isn’t a bad actor with channel access; it’s a trusted team member pasting a customer support message that happens to contain instructions. Normal workflow. The conformance-driven development technique is separately worth lifting: build tests against real implementations to formalize implicit standards that nobody wrote down.
Agents Can Now Create Cloudflare Accounts, Buy Domains, and Deploy
Cloudflare and Stripe have published APIs that let agents provision accounts, buy domains, deploy workers, and charge customers — the entire infrastructure + payment loop without a human in the flow.
Post to XWhen an agent can discover the menu, the menu is the interface.
The headline features are real, but the catalog API is the thing nobody’s talking about enough. A REST endpoint that returns structured JSON describing every provisionable service means the agent doesn’t need to know Cloudflare exists before it can use it — it queries the catalog, reasons about what it needs, and acts. That’s qualitatively different from “agent calls an API it was told about.” Prior agent-cloud integrations hardcoded capability knowledge into the agent itself. This inverts the dependency.
200k vs 1M Context in Claude Code: An Honest A/B
A hands-on comparison of 200k and 1M context windows in Claude Code, with both synthetic evals and personal workflow evidence. The 1M window caught a parse_duration edge case that 200k got confidently wrong.
Post to X“/compact at a checkpoint I chose, not one context drift chose for me.”
Two things. First, the claude --settings '<json-string>' per-invocation override flag — if you didn’t know that existed, you can now run a 6-minute test instead of an afternoon of shell wrappers. Second, the real argument for 200k isn’t server quality — it’s Jack’s point about /compact discipline. The hard cap makes compaction a deliberate act at a chosen checkpoint, not a reactive one triggered by drift. The synthetic eval (n=6, generic prompts) is too thin to settle the capability question. But the asymmetric failure matters: 200k was confidently wrong on a numerically precise edge case while 1M caught it. That’s the class of error you don’t want to discover in production. The workload-shaped personal evidence — MCP template optimization where 1M visibly outperformed — carries more weight than the eval. Both windows have tradeoffs. The honest answer is it depends on whether you trust your own /compact timing.
The structural thinking thread pulls everything together today. Simon’s confession isn’t a cautionary tale — it’s evidence that the mitigation class matters more than the mitigation itself. Behavioral controls degrade under convenience pressure. Structural ones don’t. That’s not a novel idea, but it rarely gets stated this cleanly.
🪨