RC-401h · Module 3

Rollback and Incident Response for Prompt Failures

5 min read

Every production prompt will eventually fail. The failure may be a model update that silently changes behavior, a schema change in the consuming system that renders the prompt's output format invalid, a context distribution shift that pushes the prompt into edge-case territory it was never tested for, or a deliberate adversarial input that exposes a guard-rail gap. Production readiness is not absence of failures — it is the ability to detect, contain, and resolve failures before they cause irreversible harm.

Prompt incident response follows the same runbook structure as application incident response. Detection comes from the instrumentation layer (Lesson 8): an alert fires when a threshold is breached. Triage identifies the scope of impact: how many invocations are affected, which downstream systems are receiving degraded output, and whether the failure is isolated to a single prompt variant or affecting multiple agents. Containment is the rollback.

Step 1: Detect (Instrumentation Alert) The instrumentation layer from Lesson 8 fires an alert: schema valid rate for forge/prod/system/1.4.0 dropped below the 95% threshold over the last 2 hours. This is the detection event. Record the time, the prompt reference, the metric, the threshold, and the current value. This becomes the incident record header.
Step 2: Triage (Scope of Impact) Query the telemetry store for all invocations of the affected prompt in the alert window. How many calls failed schema validation? What are the common failure patterns? Are any downstream consumers reporting application errors as a result? Which agents in the fleet depend on this prompt? Triage takes 5–15 minutes. Do not skip it to rush to rollback — rolling back the wrong thing will not fix the problem.
Step 3: Contain (Rollback or Feature Flag) Rollback is the primary containment action: pin the affected prompt reference to the last known-good version in the library. The runtime prompt resolver immediately returns the previous version for all new invocations. No code deployment required — the rollback is a library metadata update. If rollback is not available (the previous version also has issues), apply a feature flag to route affected agent calls to a human escalation path until a patched version is ready.
Step 4: Resolve (Root Cause and Regression) Identify the root cause: model update? Input distribution shift? Schema drift? For each root cause, write a new regression test case that would have caught the failure before it reached production. Run the full regression suite against the patched prompt version. Promote through the standard dev → staging → prod pipeline. Do not skip the pipeline because of urgency — a rushed hotfix without validation is the second incident waiting to happen.
Step 5: Document (Incident Postmortem) Write a one-page incident postmortem: timeline, root cause, impact scope, resolution steps, and prevention measures added to the regression suite. The postmortem is attached to the prompt library entry as a historical record. Prompt systems that accumulate postmortems without pattern-matching across them will repeat the same failures. Review postmortems quarterly and update library standards accordingly.