PM-301i · Module 2

Quality Drift Detection

4 min read

Quality drift is the gradual degradation of prompt output quality over time without any explicit change to the prompt. It happens because the world around the prompt changes: the model is updated by the provider (model drift), the upstream data that populates the prompt variables changes in distribution (input distribution drift), or user behavior shifts in ways that move the inputs outside the prompt's tested range.

Drift is insidious because it does not trigger error alerts. The system is functioning. The prompt is running. Outputs are being generated. But the outputs are slowly getting worse, and without quality monitoring, that degradation is invisible until it causes a visible problem — a user complaint, a downstream system failure, or a manual audit that reveals the drift.

Leading indicators for quality drift — signals that appear before the drift becomes severe. Format compliance drift: the format compliance rate was consistently 97% and has dropped to 91% over three weeks. No prompt change, no model change you are aware of. Investigate. User correction rate increase: correction rate was 8% and has been rising steadily to 14% over two months. The prompt has not changed. Something else has. Downstream error rate increase: the downstream system is rejecting a higher percentage of outputs. The outputs pass format compliance but fail richer validation. Output length drift: the average output token count is increasing. The prompt has not changed. The model may be generating more verbose outputs following a provider update.

QUALITY DRIFT DETECTION — SIGNAL HIERARCHY

TIER 1 (Automatable, Real-Time)
  - Format compliance rate        [alert if drops > 5% from 30-day baseline]
  - Output token count (avg)      [alert if changes > 15% from 30-day baseline]
  - API error rate                [alert if > 2x 30-day baseline]
  - Fallback trigger rate         [alert if > 3x 30-day baseline]

TIER 2 (Automatable, 24-Hour Lag)
  - Downstream error rate         [alert if drops > 3% from 30-day baseline]
  - Response latency p95          [alert if increases > 20% from 30-day baseline]

TIER 3 (Requires Instrumentation, 48-Hour Lag)
  - User correction rate          [alert if increases > 25% from 30-day baseline]
  - User regeneration rate        [alert if increases > 25% from 30-day baseline]
  - Task completion rate          [alert if drops > 5% from 30-day baseline]

TIER 4 (Manual, Monthly)
  - Golden dataset regression run [re-run monthly against production model]
  - Sample output quality review  [human review of 25 random production outputs]

ROOT CAUSE INVESTIGATION SEQUENCE
When drift detected:
1. Check: was there a model provider update? (check provider changelog)
2. Check: has input distribution shifted? (compare current input stats to baseline)
3. Check: was there an upstream data pipeline change? (check data team)
4. Check: does the drift correlate with a specific user segment or input type?
5. If none of the above: re-run golden dataset eval to confirm scope of drift