PM-301i · Module 3

The Continuous Improvement Loop

4 min read

Production prompt operations is not a project with a finish line. It is a maintenance discipline — an ongoing cadence of monitoring, evaluation, iteration, and deployment that keeps prompt systems performing reliably as the world around them changes. Teams that treat prompt deployment as a finish line accumulate operational debt: drifting prompts, stale evaluations, outdated golden datasets, and monitoring thresholds calibrated to a production environment that no longer exists.

The continuous improvement loop has four phases that repeat indefinitely. Monitor: real-time and trend-based quality signals across all production prompts. Evaluate: monthly golden dataset re-runs to detect model drift, quarterly reviews of golden dataset composition to ensure it still represents production inputs. Iterate: prompt updates, evaluation updates, monitoring threshold recalibrations triggered by the evaluation findings. Deploy: changes go through the full deployment pipeline — staging validation, regression test gate, feature flag rollout, A/B test where warranted. Then the loop restarts.

The cadence that operationalizes this loop: daily monitoring review (automated dashboards, escalate alerts), weekly quality signal review (format compliance trends, correction rate trends, token usage trends), monthly golden dataset re-run (re-run all active prompts against their golden datasets, flag regressions), quarterly library audit (prompt ownership, staleness, taxonomy consistency, coverage gaps).

Do This

  • Treat prompt ops as ongoing maintenance, not a launch milestone
  • Schedule the monthly eval run before it needs to happen, not reactively
  • Recalibrate monitoring baselines when the production environment changes significantly
  • Close the loop: every monitoring signal that triggered an investigation should result in either an action item or documented confirmation that it was a false positive

Avoid This

  • Declare a prompt "done" after the initial deploy
  • Run golden dataset evals only when there is a suspected problem
  • Let monitoring baselines drift — a baseline that reflects degraded performance will not alert on continued degradation
  • Treat the quarterly audit as optional when the team is busy — it is maintenance, not optional enrichment
prompt_ops_cadence:
  daily:
    - task: "Review monitoring dashboards"
      owner: "on-call rotation"
      duration: "15 minutes"
      scope: "All production prompts"
      actions: "Escalate any Warning alerts that have not resolved. Document Critical alerts in incident log."

    - task: "Review open alert queue"
      owner: "prompt ops lead"
      duration: "10 minutes"
      scope: "All unresolved alerts"

  weekly:
    - task: "Quality trend review"
      owner: "prompt ops team"
      duration: "30 minutes"
      scope: "All production prompts"
      review:
        - "Format compliance rate (7-day trend)"
        - "User correction rate (7-day trend)"
        - "Output token count drift (7-day trend)"
        - "Downstream error rate (7-day trend)"
      output: "Written summary in ops log. Action items if trends are negative."

  monthly:
    - task: "Golden dataset re-run"
      owner: "prompt owners"
      duration: "2-4 hours"
      scope: "All active prompts with golden datasets"
      actions: "Compare results against previous month. Flag regressions. Investigate root cause."

    - task: "Monitoring baseline recalibration"
      owner: "prompt ops lead"
      duration: "1 hour"
      scope: "All alert configurations"
      actions: "Update 30-day baselines. Adjust thresholds where false positive rate is high."

  quarterly:
    - task: "Prompt library health audit"
      owner: "prompt ops team + library owners"
      duration: "1 day"
      scope: "Full prompt library"
      checklist: "See pm-prompt-libraries PM-301g Lesson 9 checklist"

    - task: "Golden dataset refresh review"
      owner: "prompt owners"
      duration: "4 hours"
      scope: "All golden datasets"
      actions: "Verify dataset still represents production input distribution. Update if input patterns have shifted."