AS-201b · Module 3

Output Guardrails

3 min read

Output guardrails are the last line of defense between the model and the user. If your input sanitization missed the injection, if your prompt hardening was bypassed, if your context window contains sensitive data that should not have been there — the output guardrail is what prevents that data from reaching the user. It is not a nice-to-have. It is the safety net under every other layer of defense.

Pattern Matching Scan the model's output for patterns that should never appear: API keys, credit card numbers, Social Security numbers, email addresses of internal staff, database connection strings. Regex-based pattern matching catches the obvious leaks. It runs in milliseconds and costs nothing.
Classifier-Based Filtering Use a secondary model or classifier to evaluate whether the primary model's output contains sensitive information, deviates from expected behavior, or violates content policies. This catches the leaks that pattern matching misses — paraphrased sensitive information, indirect references, and context-dependent sensitivity.
Structural Validation If your model should produce a specific output format — JSON, a customer response, a product recommendation — validate the structure before returning it. Unexpected formats often indicate an injection has changed the model's behavior. A model that should return product names but returns a URL is a red flag.
Human-in-the-Loop Escalation For high-stakes outputs — financial advice, medical information, legal responses — route uncertain outputs to a human reviewer. The model flags its own confidence level, and anything below a threshold gets human eyes before it reaches the user.