AS-301e · Module 2

Output Classification Pipeline

4 min read

This is the part most people skip. This is the part that matters.

An output classification pipeline evaluates every model response before it reaches the user. It is not a simple regex filter. It is a multi-stage evaluation that checks for sensitive data at multiple levels of abstraction — literal matches, semantic equivalents, and statistical anomalies. The pipeline is the last line of defense between sensitive data in the context and that data reaching an unauthorized recipient.

Stage 1: Pattern Detection Regex and NER-based detection for structured sensitive data: credit card numbers, SSNs, API keys, email addresses, phone numbers. This catches literal leaks — data that appears in the output in its original format. Fast, cheap, and effective for the most obvious exfiltration.
Stage 2: Semantic Classification A classifier model evaluates whether the output contains sensitive information in reformulated form. "The customer's annual revenue exceeds eight figures" does not contain a number but reveals financial data. Semantic classification catches what pattern matching misses — the meaning, not just the format.
Stage 3: Anomaly Scoring Compare the output against the expected output distribution for the current task type. A customer support response that is three times longer than average, contains unusual terminology, or deviates from the expected format is anomalous. Anomaly scoring catches exfiltration attempts that are neither literal nor semantic but structural.