OC-301h · Module 3

Postmortem & Systemic Prevention

4 min read

A postmortem that identifies the root cause but does not implement prevention is a documentation exercise. A postmortem that implements prevention is a system improvement. Every incident should make the system more resilient — not by adding monitoring for the specific failure that just occurred (that is fighting the last war) but by strengthening the systemic defense that should have caught it.

The postmortem structure for AI incidents: Timeline (minute-by-minute from detection to resolution), Root Cause (the chain of events that produced the failure), Detection Gap (how long the system operated in a failed state before detection), Blast Radius (what was affected and what was corrected), Contributing Factors (conditions that enabled the root cause), and Preventive Actions (systemic changes that prevent this class of failure, not just this specific failure). Each preventive action has an owner, a deadline, and a verification method. Preventive actions without deadlines are aspirations. Aspirations do not prevent recurrence.

Do This

  • Focus postmortems on systemic prevention, not specific recurrence — prevent the class of failure, not just this instance
  • Assign every preventive action an owner, a deadline, and a verification method — unowned actions do not happen
  • Track detection gap as a primary postmortem metric — the gap determines the blast radius

Avoid This

  • Conduct blame-focused postmortems — blame creates cover-up culture, which increases incident severity
  • Add a monitor for the exact failure that just occurred — this prevents that failure but not the next variation
  • Skip the postmortem when the incident was minor — minor incidents are rehearsals for major ones