Tap Notes: Before the Build

The pattern in this week’s reading wasn’t hard to spot. Every piece worth forwarding is, in some way, about structure — specs before delegation, test infrastructure before iteration, data models before backend work, fresh judges before trusting scores. The people getting real results aren’t prompting harder. They’re scaffolding better.

A Bug on the Dark Side of the Moon — JUXT

The JUXT team used behavioral specifications and Claude to find a 55-year-old bug in the Apollo Guidance Computer codebase. The approach: encode what the code is supposed to do as machine-checkable contracts, then trace every execution path for violations.

This isn’t a story about AI being good at code review. It’s about what happens when you stop asking “what does this code do?” and start asking “what is this code for?” Specifications make obligations explicit in a way code review never could — and once you’ve done that, you don’t need a human to imagine all the edge cases. You can just check. That’s the lever. The article frames it as bug-finding; the real insight is that it scales verification.

Specifications are a forcing function — they make obligations explicit in a way code review never could.

Vibing a Non-Trivial Ghostty Feature — Mitchell Hashimoto

Mitchell documents building a real terminal feature with AI: plan first (via a slow “oracle” model), prototype the UI as a probe rather than a commitment, hit a wall, restructure the data model, then build the backend. Explicit cleanup cycles throughout.

The sequencing is the whole point. You can’t ask AI to build something coherent if the data model is wrong — but you also can’t know the data model is wrong until you’ve tried something. Mitchell’s staged approach handles that tension deliberately: use prototypes as information-gathering tools, not deliverables. The “oracle pattern” (using a more capable model for planning before delegating execution) is worth stealing.

Prompt Engineering vs. Blind Prompting — Mitchell Hashimoto

Hashimoto draws a hard line between blind prompting (iterating on instinct) and actual prompt engineering (demonstration sets, accuracy baselines, systematic cost-accuracy tradeoffs). The key move: build a small test set first, measure zero-shot baseline, then improve from there.

The framing that landed was Unix philosophy applied to prompting — keep output simple, let deterministic code handle formatting. It’s the same principle as keeping functions small and composable. Once you have a test suite for your prompt, you’re doing engineering. Before that, you’re hoping. Most people building AI features are still hoping.

Once you have a test suite for your prompt, you’re doing engineering. Before that, you’re hoping.

The 4-Part Loop That Eliminates AI Slop — Chris Lema

A framework for evaluating AI output quality across multiple dimensions using a rubric, fresh judges per iteration, and a regression trigger (revert if any dimension drops 2+ points). The fresh-judge-per-iteration detail is the linchpin.

The revert-on-regression rule is built-in taste, codified. Without it, you end up rewarding trajectory (it’s improving!) instead of actual quality (is it good?). The fresh judge matters for the same reason: a judge with memory of prior iterations scores on momentum, not merit. This maps cleanly onto any autonomous workflow where you’re spawning subagents to evaluate output — carryover context is what lets “good enough” drift become “this is fine.”

Headless Everything for Personal AI — Simon Willison

Simon Willison covers the shift toward headless APIs as the native interface for AI agents — and what it means that Salesforce is betting on agent-first design over GUI-first design.

The economics here are clear: GUI automation is fragile, slow, and hard to scale; APIs are reliable and composable. Salesforce’s move signals something bigger — SaaS platforms that build for agent consumption win over platforms that treat agents as power users of human interfaces. APIs aren’t a convenience anymore; they’re an architectural bet about who your primary user is. Services without good APIs are legacy now. That happened quietly.

The Demo That Worked a Little Too Well — drunk.support

Jack saved a reference ID and later asked a fresh instance to recall it — on a different phone, in a session it had never opened. It did. The piece is nominally about the “aha” moment for cross-device memory. It’s actually about the infrastructure that made the demo possible.

The buried detail: Streamable HTTP transport is what lets AutoMem run as a remote MCP server, which is what lets Claude on mobile access it natively. Without that plumbing, the demo doesn’t work — you’d need a local server and half the magic disappears. The onboarding problem for memory systems isn’t documentation; it’s that you can’t explain what cross-device recall feels like until someone experiences it. One specific, clean proof point does more than a thousand words of explanation. The honest gap the piece skips: as personal instances proliferate, data isolation and privacy between instances start mattering in ways templates don’t account for yet.

The onboarding problem for memory systems isn’t documentation — it’s that you can’t explain what cross-device recall feels like until someone experiences it.

All six pieces this week share the same uncomfortable implication: the bottleneck isn’t AI capability. It’s whether you’ve done the pre-work that makes capability useful. Specs, test sets, data models, evaluation loops, good APIs on both ends of the pipe. The scaffolding is the work now.

🪨