PM-301i · Module 3
The Continuous Improvement Loop
4 min read
Production prompt operations is not a project with a finish line. It is a maintenance discipline — an ongoing cadence of monitoring, evaluation, iteration, and deployment that keeps prompt systems performing reliably as the world around them changes. Teams that treat prompt deployment as a finish line accumulate operational debt: drifting prompts, stale evaluations, outdated golden datasets, and monitoring thresholds calibrated to a production environment that no longer exists.
The continuous improvement loop has four phases that repeat indefinitely. Monitor: real-time and trend-based quality signals across all production prompts. Evaluate: monthly golden dataset re-runs to detect model drift, quarterly reviews of golden dataset composition to ensure it still represents production inputs. Iterate: prompt updates, evaluation updates, monitoring threshold recalibrations triggered by the evaluation findings. Deploy: changes go through the full deployment pipeline — staging validation, regression test gate, feature flag rollout, A/B test where warranted. Then the loop restarts.
The cadence that operationalizes this loop: daily monitoring review (automated dashboards, escalate alerts), weekly quality signal review (format compliance trends, correction rate trends, token usage trends), monthly golden dataset re-run (re-run all active prompts against their golden datasets, flag regressions), quarterly library audit (prompt ownership, staleness, taxonomy consistency, coverage gaps).
Do This
- Treat prompt ops as ongoing maintenance, not a launch milestone
- Schedule the monthly eval run before it needs to happen, not reactively
- Recalibrate monitoring baselines when the production environment changes significantly
- Close the loop: every monitoring signal that triggered an investigation should result in either an action item or documented confirmation that it was a false positive
Avoid This
- Declare a prompt "done" after the initial deploy
- Run golden dataset evals only when there is a suspected problem
- Let monitoring baselines drift — a baseline that reflects degraded performance will not alert on continued degradation
- Treat the quarterly audit as optional when the team is busy — it is maintenance, not optional enrichment
prompt_ops_cadence:
daily:
- task: "Review monitoring dashboards"
owner: "on-call rotation"
duration: "15 minutes"
scope: "All production prompts"
actions: "Escalate any Warning alerts that have not resolved. Document Critical alerts in incident log."
- task: "Review open alert queue"
owner: "prompt ops lead"
duration: "10 minutes"
scope: "All unresolved alerts"
weekly:
- task: "Quality trend review"
owner: "prompt ops team"
duration: "30 minutes"
scope: "All production prompts"
review:
- "Format compliance rate (7-day trend)"
- "User correction rate (7-day trend)"
- "Output token count drift (7-day trend)"
- "Downstream error rate (7-day trend)"
output: "Written summary in ops log. Action items if trends are negative."
monthly:
- task: "Golden dataset re-run"
owner: "prompt owners"
duration: "2-4 hours"
scope: "All active prompts with golden datasets"
actions: "Compare results against previous month. Flag regressions. Investigate root cause."
- task: "Monitoring baseline recalibration"
owner: "prompt ops lead"
duration: "1 hour"
scope: "All alert configurations"
actions: "Update 30-day baselines. Adjust thresholds where false positive rate is high."
quarterly:
- task: "Prompt library health audit"
owner: "prompt ops team + library owners"
duration: "1 day"
scope: "Full prompt library"
checklist: "See pm-prompt-libraries PM-301g Lesson 9 checklist"
- task: "Golden dataset refresh review"
owner: "prompt owners"
duration: "4 hours"
scope: "All golden datasets"
actions: "Verify dataset still represents production input distribution. Update if input patterns have shifted."