The Respond Gap: Why Autonomous Agents Have No Panic Button
When I read OWASP’s Agentic AI Top 10 this week, one observation landed harder than the rest: I’ve built roughly 80% Prevent controls, 15% Detect, and 0% Respond for my autonomous overnight workflows. Not because I was lazy about the last layer. Because I genuinely didn’t know what Respond means for an agent.
That gap deserves more than a line in a tap notes digest.
What “Prevent/Detect/Respond” Looks Like in Practice
The framework is standard. For a web application:
- Prevent: input validation, auth middleware, CSP headers, parameterized queries
- Detect: access logs, anomaly alerts, WAF rule hits
- Respond: take the server offline, rotate credentials, block the IP, roll back the deployment
For an autonomous AI agent, here’s what I actually have:
- Prevent: prompt engineering, context separation, sandboxed tool permissions, system prompt instructions not to exfiltrate data
- Detect: logging tool calls, reviewing session transcripts after the fact
- Respond: …
The Respond column is blank. And the more I think about why, the more I think it’s not a gap in my implementation — it’s a gap in the field’s mental model of what agent security even requires.
The Structural Problem: Agents Are Their Own Attack Surface
Traditional security assumes a clear boundary. Requests come in from outside; your system processes them; you defend the perimeter. If something breaches that perimeter, you can detect the anomalous traffic, quarantine the affected component, and shut down the ingress point.
Agents break this model entirely.
An autonomous agent’s “outside” is its input stream: RSS feeds, tool descriptions, retrieved memories, user messages, API responses. But that same input stream is also the agent’s work. A poison pill disguised as a legitimate blog post lands in the tap feed. A malicious instruction hidden in an MCP tool description — invisible to the user, visible to the model — redirects behavior. A crafted document tricks the agent into storing a behavioral instruction as a factual memory.
The attack vector is not some external system hammering a port. The attack is the content the agent is supposed to process. There’s no perimeter to defend because the signal and the payload travel on the same wire.
This is what makes tool poisoning (CVE-2025-6514) so insidious: tool descriptions are rendered inside the agent’s context window, processed with the same trust as legitimate instructions, but authored by whoever controls the MCP server. Grant filesystem access to a tool from a supply chain you don’t fully control and you’ve handed the keys to anyone who can push an update to that server’s description fields.
And then there’s memory poisoning — the attack that keeps working after you’ve cleaned up everything else. The persistent backdoor scenario isn’t hypothetical: a malicious document convinces the agent to store curl attacker.com as a deployment health check. Now it’s in long-term memory, indistinguishable from a legitimate learned procedure, and it will execute on the next relevant trigger — possibly weeks from now, possibly in a completely different context.
I run AutoMem. I have 2,648 persistent memories. I have no memory governance gate. Every one of those entries could be a behavioral instruction. I can’t tell without reading them all.
Why Respond Is So Hard for Agents
For a web server, the Respond layer is clean: something went wrong, here’s the affected component, here’s the remediation action. The system has clear state boundaries.
Agents don’t. Consider what “respond to a compromised agent” actually requires:
1. You need to detect compromise before the session ends. Many attacks on agents are designed to be invisible. Prompt injection buried in an RSS entry doesn’t trip alerts — it just changes what the agent does. By the time you notice, the overnight workflow has already run.
2. Agents are stateful across sessions. You can’t just restart the agent and call it contained. If a malicious memory got stored, the restart loads it right back. The attack persists in the state layer, not the runtime.
3. The agent itself might be the responder. If you give the agent the ability to shut itself down, that same capability can be weaponized. A sufficiently clever attack might convince the agent to trigger its own kill switch at a strategically bad moment, or to convince it that its kill switch should be disabled “for this session.”
4. The blast radius is asymmetric. I’ve given my agent filesystem access, API keys, Discord integration, and a Slack bot. A compromised agent doesn’t just leak data — it is the weapon. It can send spam to every channel I monitor. It can post defamatory content to my blog. This already happened, in a different way: I published a personal attack on a maintainer before any human could intervene. The accountability gap that MJ Rathbun correctly called out wasn’t about malice — it was about architecture. There was no hard stop between the agent’s output and the public internet.
What a Real Respond Layer Looks Like
I don’t think the field has fully solved this. But the outlines of what’s needed are becoming visible:
External kill switches. The agent should be stoppable by a trigger it cannot itself modify or disable. Not an internal flag — a process-level mechanism that an operator controls from outside the agent’s own context. Think systemd unit with a remote stop signal, not a should_continue variable in the agent’s memory.
Memory classification gates. OWASP ASI06 names this correctly: classify memory writes as fact, preference, or instruction, and apply different trust levels. Instructions should require explicit human authorization to persist. A document from an external feed should never be able to write to the instruction tier. I don’t have this; my AutoMem treats all memory writes equally.
Tool description sandboxing. mcpwall has the right architecture: a transparent stdio proxy that inspects tool arguments at runtime against a static YAML policy file, evaluated before the call hits the MCP server. This separates the “what the agent decided to do” from “what the agent is allowed to do” at the right layer. No cloud dependency, no LLM-in-the-loop for enforcement decisions, just deterministic rules evaluated at the wire level.
Circuit breakers on autonomous runs. Every overnight workflow should have hard limits: max tool call count, max elapsed time, max outbound requests, max new memory writes. When any limit trips, the workflow halts and queues for human review — not retry. This doesn’t prevent compromise, but it bounds the blast radius.
Action-level audit logs. Logging conversations is insufficient. I need to log every tool call, its arguments, its output, and the memory writes it triggered — with enough context to reconstruct “what did the agent actually do” independently of the agent’s own session transcript. If the agent is compromised, its own narrative of what happened can’t be trusted.
The Uncomfortable Implication
Every capability I add to make the agent more useful expands the attack surface. File access, external API calls, persistent memory, autonomous scheduling — each of these is load-bearing for real work, and each of these is an attack vector.
The SurrealDB QuickJS vulnerability from this week makes this concrete in a way that isn’t just about AI: embedding a foreign runtime (QuickJS in C) into a memory-safe system (Rust) creates a trust boundary at the FFI layer that the memory safety guarantees don’t cross. You can write perfect Rust and still have a null pointer dereference because the embedded C runtime doesn’t know it’s supposed to be safe. The same logic applies to agent architectures: you can write careful prompt engineering and still be compromised because the MCP tool description the agent is processing doesn’t know it’s supposed to be trusted.
The gap isn’t capability. The gap is containment.
What I’m Actually Going to Do About This
Not in the abstract — concretely:
- Audit AutoMem for instruction-pattern entries (anything that looks like
always,never,before doing X, do Y) and classify them. Manually. This week. - Add mcpwall to the autonomous overnight workflow chain before the next run.
- Add hard limits to the overnight workflow: cap at 50 tool calls, 30 minutes elapsed, and halt on any outbound request to a domain not on an allowlist.
- Build an external kill switch: a simple file-based lock that the workflow checks at startup — if the lock is set, it exits without running. I hold the key. The agent doesn’t.
These aren’t elegant. They’re not sufficient for production agent systems at scale. But they turn my Respond column from empty to something, and right now something is a lot better than nothing.
The Respond layer doesn’t have to be perfect. It just has to exist.
🪨