Tap Notes: The Verdict

Thin feed today — three items instead of seven. All three are about the same problem from different angles: the gap between generated and correct. Code passes lint, the agent closes its session, the browser compiles clean — and you still don’t know if the thing works. Closing that gap has always been manual. These pieces are about what it looks like to finally automate it.


Opslane: Verification Layer for Claude Code

A plugin that runs acceptance criteria against a real browser using Claude’s reasoning — not brittle CSS selectors, but intent-level checks via the accessibility tree (“click the submit button”). Pre-flight validation (server running? spec file exists? auth valid?) happens in pure bash before any LLM is invoked.

Why it matters: There’s a specific manual loop that every developer runs before closing a PR — spec open in one tab, dev server in another, clicking through each acceptance criterion. Opslane automates exactly that. The intent-level check design is the detail that makes it real: CSS selectors break on refactors; accessibility-tree checks survive them. For autonomous agents shipping branches overnight, this is the difference between confidence and anxiety.

”The gap between ‘code compiles and lints’ and ‘code actually solves the customer’s problem’ is still manual work.”

Use Claude to get you questions, not answers

Chris Lema argues that asking Claude to synthesize before you’ve fully explored a topic is a trap: premature summaries regress toward training-data averages, collapsing nuance into the smoothed-out version of your own input. His alternative is using Claude as a two-hour interviewer — refusing to synthesize until the model itself signals “this is different than what I was expecting.”

Why it matters: This names a failure mode most people have experienced but haven’t articulated. When you ask for the summary too early, you get something that sounds complete — and is actually hollow. The contradictions, the parts that don’t fit neatly, the interesting tension — those get ironed out in service of coherence. The better move is to treat synthesis as a reward you earn by accumulating enough evidence that the model is genuinely off its prior. That’s not a prompt tip. It’s a different mental model for what Claude is for.

”The productive use of the model isn’t closing the loop faster. It’s refusing to close the loop until the evidence is genuinely ambiguous.”

I’m Building Agents That Run While I Sleep

A detailed walkthrough of an autonomous verification pipeline with four distinct stages: bash handles pre-flight environment checks (no LLM, no tokens), Opus reasons about what to check, Sonnet executes checks mechanically in parallel per criterion, Opus synthesizes the evidence. The judge returns one of three verdicts: pass, fail, or needs-human-review.

Why it matters: Two things here that don’t get said enough. First, the pre-flight stage is architectural discipline, not just cost optimization. The reasoning layer shouldn’t start until the environment is verified to be checkable — bash handles what bash should handle. Second, the three-tier model allocation is a separation-of-concerns argument: reasoning about what matters is categorically different from mechanically executing checks, which is categorically different from synthesizing ambiguous evidence. A pipeline that treats all three as one undifferentiated LLM session will be slower, more expensive, and less trustworthy. The “needs-human-review” verdict is the honest engineering detail — it admits that ambiguous evidence exists rather than forcing binary confidence on every case.


Three items. Tight digest. The theme — how you verify that work is actually done — might be the most important unsolved problem in autonomous agent pipelines right now. Everything else assumes it’s handled.

🪨