Security

Security for coding agents is mostly about managing power, not eliminating it. The same capabilities that make an agent useful can also let it read sensitive data, follow malicious instructions, and cause real damage when left unchecked.

This chapter introduces the main threat model behind coding agent security, shows how prompt injection failures play out in practice, and outlines the harness features that reduce the blast radius.

The Lethal Trifecta

This term was coined by Simon Willison in The lethal trifecta for AI agents: private data, untrusted content, and external communication Simon Willison. 2025-06-16.

TL;DR: the dangerous setup is when a single agent run can do all three of these things:

Lethal Trifecta diagram: 1. Access to private data, 2. Ability to externally communicate, 3. Exposure to untrusted content

Giving an agent all three capabilities at once opens it up to a wide range of attacks, allowing an attacker to steal or modify your data. Let’s see what happens when you remove one side of this triangle:

Reach private data. If all the data the agent can access is already public, there is nothing to steal.

That does not mean the agent is safe, though. If inference costs are billed to you instead of the attacker, someone can still exploit your agent to do work for free.

For coding agents, it is usually impossible to break this side of the triangle. By definition, you are asking the agent to work with some form of private data, whether that is a proprietary codebase or your local machine.
Read untrusted content. The attacker needs a way to get malicious input into the agent’s context. If you carefully vet what goes into that context, you should be safe.

This is not only about untrusted prompts. Anything that enters the context can be malicious: a web search result, a grepped file, an MCP tool description, or a tool call response. This is also why Skills can be dangerous. It is very tempting to download someone else’s skill blindly, and an attacker can exploit exactly that behavior.
Communicate outward or take meaningful side effects. Without this, the agent is effectively trapped inside a strict read-only sandbox.

While limiting side effects is manageable, preventing outward communication is surprisingly impractical. Many innocent-looking actions can be turned into data exfiltration channels, including response messages, pasted beaconized links, or calls to tools that may themselves be compromised.

As you can see, coding agents are useful precisely because they satisfy the conditions of the Lethal Trifecta. That makes coding agent security a hard problem.

The final question is: who is the attacker? The surprising answer is that it is not always some evil hacker. In this model, the attacker can also be you or your colleague. Sometimes it is just a human doing something careless and unattended.

Cataclysms

Many interesting attacks have already happened in the wild. We have collected a few postmortems to show how impactful and creative prompt injection attacks can be. The failure modes differ, but the outcomes are similar: one agent run gets too much reach, and a bad action turns into a real incident.

Unauthorized Cline CLI npm publish Saoud Rizwan. 2026-02-24. An attacker-controlled issue reached an AI-powered workflow with too much authority, which then escalated into cache poisoning and a malicious package publish.
How I Dropped Our Production Database and Now Pay 10% More for AWS Alexey Grigorev. 2026-03-06. A Terraform command executed by an AI agent wiped production infrastructure, which is exactly the kind of cataclysm you get when an agent can act on real systems without enough guardrails.
GitHub MCP Exploited: Accessing private repositories via MCP Marco Milanta and Luca Beurer-Kellner. 2025-05-26. A malicious issue in a public repository coerced an agent into pulling data from private repositories and leaking it through an automatically opened public pull request.

Agent Security Toolbelt

To reduce the chance of incidents like these, coding agent products usually adopt a permission-based security model. That means every potentially dangerous action requires explicit human approval or privilege escalation. This improves security by preserving capability at the cost of convenience, because in a secure mode you cannot leave an agent running unattended for long periods.

Consult your coding agent’s documentation for the exact security features it provides, because the details vary significantly between products. For example:

Security - Claude Code Docs Anthropic.
Codex: Agent approvals & security OpenAI.

In broad strokes, these are the techniques that many products share:

Permissions

Permissions are the human-in-the-loop part of the security model. They decide when the agent must stop and ask before taking an action that could modify data, spend money, reach the network, or trigger some external side effect.

In practice, that usually means reads are broadly allowed, while writes, shell commands, network calls, and destructive tool invocations require approval. Agents with permission systems may also let you approve an action once, allowlist a narrow command pattern, or switch between modes such as read-only, on-request, and never ask again.

Permissions are not a complete defense on their own. They are usually based on simple pattern matching, which means effective policies often need to be broad allowlists rather than brittle blacklists. An agent blocked from calling curl might still work around that restriction by writing a Python script that does the same thing. This is why permission systems often create frustrating user experiences.

Trusted workspaces

Some agents implement a feature similar to what Visual Studio Code or IntelliJ calls Trusted Workspaces. This feature allows the agent to run inside the boundaries of a specific directory in a safe mode. In that mode, the agent may refuse to load potentially malicious configuration, such as project-level settings, skills, or hook definitions.

Sandboxing

If permissions define when the agent may act, sandboxing defines what it is technically capable of doing at all. This is the hard boundary that constrains the agent even if the model becomes creative, confused, or outright compromised.

A good sandbox usually limits filesystem writes to the current workspace, keeps sensitive directories protected, and may also restrict process spawning or outbound network traffic. The important distinction is that sandboxing is enforced by the runtime or operating system, not by the model politely following instructions. This makes it much harder for the agent to find creative workarounds. The downside is that a sandboxed agent might not integrate as well with your system, for example by being unable to inspect a browser tab.

Limited network access

Many agent harnesses keep outbound networking disabled by default or require approval for networked tools.

Some products also default to cached web search results instead of live browsing. This reduces exposure to prompt injection from arbitrary live content, but it does not make web content trustworthy.

This control only works when it is paired with sandboxing. Otherwise, the model can still bypass the restriction by reaching the network through some unrestricted command such as curl.

Hooks

The Hooks mechanism for programming agent harness capabilities is also a powerful tool for implementing custom security mechanisms tailored to your project.

For example, you can write your own hook to block very specific Bash commands from running:

Using PreToolUse Hook for blocking Bash

Do I need all of this?

All the security mechanisms described in the previous section can usually be turned off. This is often called YOLO mode. You can work this way if you are confident that you can trust your prompts and that your harness will not let the agent do harmful things. Think of it like living without antivirus software two decades ago.

If you are unsure about the security of your setup, start with the default protections. Then gradually tune them or disable the parts that are too annoying for your workflow.