AS-201b · Module 2

Defense in Depth

4 min read

There is no single defense against prompt injection. Full stop. If anyone sells you a product that "solves prompt injection," they are either lying or confused. The nature of the vulnerability — untrusted text sharing a context window with trusted instructions in a system that processes both identically — means no individual mitigation is complete. The correct strategy is defense in depth: multiple independent layers, each of which catches what the others miss.

Layer 1: Input Sanitization Filter known injection patterns from user input before it reaches the model. Strip instruction-like language, escape special tokens, limit input length. This catches the obvious attacks — "ignore previous instructions," role-play requests, system prompt extraction attempts. It will not catch everything. That is why it is layer one, not the only layer.
Layer 2: System Prompt Hardening Design system prompts that explicitly instruct the model to resist override attempts. Include specific refusal instructions for common attack patterns. Frame the model's role narrowly and reinforce boundaries. Hardening is not foolproof — a sufficiently creative attacker can bypass any instruction — but it raises the bar from trivial to difficult.
Layer 3: Output Validation Check the model's output before it reaches the user or any downstream system. Does the response contain information that should be confidential? Does it deviate from the expected format? Does it include instructions or URLs that the model should not be generating? Automated classifiers can flag suspicious outputs for human review.
Layer 4: Architectural Isolation Limit what the model can access and do. If the model does not have access to the customer database, a successful injection cannot exfiltrate customer data. If the model cannot send emails, it cannot be tricked into forwarding sensitive information. Least privilege is the defense that works even when every other layer fails.

Can you explain why no single layer is sufficient? Not that it is insufficient — why. If you can answer that, you understand prompt injection at a deeper level than most security professionals. The answer: input sanitization fails because you cannot enumerate every possible injection phrasing. Prompt hardening fails because the model cannot fundamentally distinguish instructions from overrides. Output validation fails because not every malicious output looks abnormal. Architectural isolation fails because some agents need access to sensitive systems to do their job. Each layer has a known failure mode. The combination of all four layers means an attacker must bypass all four failure modes simultaneously.