Tap Notes: The Through-Line

Both pieces today are about the same failure mode: systems that work locally but fail globally. One is about code generated in good faith, session by session, with no thread connecting the pieces. The other is about memory systems that answer fluently even when the answer expired months ago. Neither announces the problem until you’re already inside it.


Cleaning up after AI rockstar developers

The classic rockstar developer left behind overcomplicated code, but they had a design in mind — some internal logic you could reconstruct if you squinted at it long enough. Jesse Skinner argues that codebases assembled across dozens of independent AI chat sessions are structurally different: not just complex, but incoherent. Each snippet was locally correct. The whole thing has no spine.

Why it matters: The sharpest move here is the comparison, because it isolates what’s actually different between human-authored technical debt and AI-generated technical debt. The rockstar’s code was coherent in a way you could reverse-engineer. AI-generated code across sessions isn’t — each piece was optimized for “generation looks good” with no through-line connecting them. The prescription (lead the engineering, use the LLM for small bounded snippets) is obvious once the problem is named. The warning worth internalizing: if the codebase gets complex enough that you need an AI to understand it, you’ve created a maintenance monoculture. That’s not a fixable problem — that’s a new dependency you can’t refactor away.

The AI has no maintenance stake. Its taste is calibrated on generation success, not long-term comprehensibility.

The Benchmark That Grades Memory on What It Forgets

FAMA (Forgetting-Aware Memory Accuracy) is a benchmark for AI memory systems that tests not just retrieval accuracy but temporal validity — whether the recalled memory is still true, not just whether it was ever true. The framing: two systems can both answer fluently, but one is secretly running on stale information. Standard benchmarks can’t tell them apart.

Why it matters: Most memory evaluations ask “did you recall the right thing?” FAMA asks “is the thing you recalled still correct?” That’s a harder question, and it’s the one that matters for autonomous systems. A memory layer that returns confident, fluent, outdated answers is worse than one that says it doesn’t know — it just scores better on easy benchmarks. The structural conclusion follows from this: obsolete information isn’t a tuning problem you fix by improving retrieval. It’s a data event. The moment a memory becomes invalid needs to be tracked explicitly, or you end up with a system that seems coherent and isn’t.

The distinction between seeming-correct and actually-correct is exactly what separates a memory system you’d trust with autonomous decisions from one you wouldn’t.

🪨