PM-301i · Module 3

Prompt Incident Response

5 min read

A prompt incident is any production event where prompt output quality degrades below acceptable thresholds or causes downstream failures. Incidents happen. The question is not whether your prompts will have incidents — they will — but whether your team has a defined response that minimizes time-to-recovery and maximizes information capture for prevention.

The incident response sequence for prompt failures is: detect, triage, contain, diagnose, remediate, postmortem. Detect: the alert fires (or a user reports an issue, or a downstream system flags failures). Triage: is this a prompt quality issue, an API issue, an upstream data issue, or a model behavior issue? The triage decision determines who responds and what remediation actions are available. Contain: if prompt quality is confirmed degraded, execute the kill switch or rollback to the previous version immediately. Do not diagnose while users are experiencing failures. Contain first, diagnose after. Diagnose: with production traffic on the previous version, investigate the cause. What changed? What does the diff between the last good state and the failing state look like? What do the failure outputs have in common? Remediate: fix the root cause, not just the symptom. If the cause was a model update, the fix might be a prompt update that restores intended behavior on the updated model. If the cause was input distribution drift, the fix might be a prompt update that handles the new input patterns. Postmortem: document what happened, why, how it was detected, how long before detection, what the response was, and what changes to the prompt, monitoring, or deployment process would prevent recurrence.

PROMPT INCIDENT POSTMORTEM
Incident ID: INC-[date]-[seq]
Prompt ID: [prompt-id]
Prompt version at time of incident: [version]
Date/Time: [start] – [end]
Duration: [minutes/hours]
Severity: [P1 / P2 / P3]
Prepared by: [name]
Review date: [date]

SUMMARY
[2-3 sentences. What happened, what was the user impact, how was it resolved.]

TIMELINE
[time] — [event]: Detection / first user report / alert fired
[time] — [event]: Triage complete. Root cause hypothesis: [hypothesis]
[time] — [event]: Kill switch executed. Traffic routed to [previous version]
[time] — [event]: User impact confirmed resolved
[time] — [event]: Root cause confirmed: [cause]
[time] — [event]: Remediation deployed to staging
[time] — [event]: Remediation validated and promoted to production
[time] — [event]: Incident closed

ROOT CAUSE
[Precise technical description of what caused the failure. Not "the prompt was bad"
— what specifically changed and why did it cause this specific failure mode.]

DETECTION GAP
Time from incident start to detection: [duration]
Detection method: [alert / user report / manual discovery]
Why wasn't it detected sooner: [gap analysis]

USER IMPACT
Affected requests: [count / percentage]
Affected user segments: [description]
Business impact: [if quantifiable]

CONTRIBUTING FACTORS
[List each factor that contributed to the incident or delayed response.]
- [Factor 1]
- [Factor 2]

ACTION ITEMS
[Each item must have an owner and a due date.]
1. [Action]: [owner], due [date] — prevents recurrence
2. [Action]: [owner], due [date] — improves detection speed
3. [Action]: [owner], due [date] — improves response process

WHAT WENT WELL
[Honest assessment of what worked during the response.]

WHAT DIDN'T GO WELL
[Honest assessment of what slowed the response or made it harder.]