AS-301d · Module 2

Canary Tokens and Tripwires

3 min read

A canary token is a piece of information planted in the system prompt that has no legitimate reason to appear in any output. If the model outputs the canary, it has been manipulated — either by injection or by a system prompt extraction attack. The canary does not prevent the attack. It detects it. And detection at the output layer catches attacks that every input filter missed.

System Prompt Canaries Embed a unique, random string in the system prompt with an instruction never to output it. If the string appears in any model output, the output is flagged and blocked. The canary should be long enough to avoid false positives and random enough that the model has no reason to generate it independently.
Data Canaries Plant synthetic data records in databases and documents that the agent can access. If a canary record appears in model output, the model has accessed and leaked data it should not have surfaced. Data canaries detect exfiltration attacks that use the model as a data extraction channel.
Behavioral Tripwires Define behavioral boundaries that, if crossed, trigger an alert. If the model attempts to access a tool it has never used before, generates an output in an unexpected format, or references information from a previous user's session, the tripwire fires. Tripwires are cheap to implement and catch novel attack patterns that signature-based detection misses.