RC-401i · Module 2
Agent Security Surface: Prompt Injection, Data Exfiltration, and Scope Creep
5 min read
AI agents are not conventional software. Conventional software executes instructions that were written by a human and reviewed before deployment. An AI agent executes instructions that arrive at runtime — from users, from retrieved documents, from tool call results, from other agents. Any of those instruction sources can be malicious. Prompt injection is the class of attack that exploits this. It is not a theoretical concern. It is the most commonly observed attack vector against deployed AI systems, and most organizations find out about it from a security researcher or a breach notification rather than from their own testing.
Prompt injection exploits the fact that most language models cannot reliably distinguish between trusted instructions in the system prompt and untrusted content in the user turn or retrieved context. An attacker who can insert text into the model's context window — through a malicious document in the retrieval corpus, a crafted user input, a poisoned tool call response, or a compromised agent communication — can potentially redirect the model's behavior. The agent that was supposed to summarize a contract can be instructed by content in the contract itself to exfiltrate the system prompt, make unauthorized API calls, or suppress its normal output.
- 1. Prompt Injection Surface Mapping Map every point where external content enters the model's context window: user inputs, retrieved documents, tool call responses, API response bodies, agent-to-agent messages, and email or document content if the system processes those. For each entry point, assess: can an attacker control the content at this entry point? What is the maximum damage if that content contains a malicious instruction? Entry points where attacker control is high and potential damage is severe require isolation controls — content sanitization, instruction tagging, or sandboxed execution contexts.
- 2. Data Exfiltration Prevention A model that has read access to sensitive data and output channels that are not content-filtered can exfiltrate data through its normal output. Review every output channel — API responses, generated documents, emails, notifications, log entries — for whether model outputs flow through unfiltered. Implement output inspection for sensitive data patterns: PII, credentials, internal system references. Tool use requires particular scrutiny: an agent with access to an email send tool and a database read tool can be instructed to combine them in ways the designer did not intend.
- 3. Scope Creep Controls An agent's permissions should be the minimum required to complete its defined tasks — and no more. Review the tool suite available to each agent at deployment. Remove tools the agent does not need for its documented use cases. Implement per-tool rate limiting and anomaly alerting. An agent that suddenly begins making high-volume calls to a tool it rarely used is exhibiting a behavioral anomaly. That anomaly should trigger a human review, not a silent log entry.
- 4. Agent-to-Agent Trust Hierarchy In multi-agent systems, define the trust hierarchy explicitly. A subordinate agent receiving instructions from an orchestrator agent should not blindly execute those instructions without validation. Define which agents can issue which commands to which other agents. Log all agent-to-agent instruction exchanges. If an agent receives an instruction that exceeds its normal operational scope — regardless of which agent issued it — the instruction should be flagged and reviewed before execution.