PM-201c · Module 2
Regression Testing
3 min read
A regression is a change that degrades performance on a task the previous version handled correctly. In software, regressions are caught by test suites run before deployment. In prompt systems, regressions are caught by running the updated prompt against the golden dataset before promoting it to production. The test suite is the golden dataset. The test runner is whoever made the change. The gate is the approval threshold.
- 1. Run before any promotion No prompt version moves from staging to production without running against the golden dataset. This applies to MAJOR, MINOR, and PATCH changes — even "trivial" wording changes can produce unexpected output changes on edge-case inputs.
- 2. Compare against the previous version Run the previous version and the new version against the same dataset. Compare results on every sample. Note any samples where the new version performs worse than the previous, even if both pass the approval threshold. Regression in one area may be masked by improvement in another.
- 3. Document the results Record the test results in the changelog: how many samples passed, any failures and their classification, and whether the threshold was met. This is the evidence that the change was tested and what it found.
- 4. Automate where possible For format compliance and schema validation checks, automate the evaluation — a script can verify JSON schema compliance, word count, required field presence, and other objective criteria without human review. Reserve human review for quality dimensions that require judgment.
Do This
- Run regression tests for every version change before production promotion
- Compare new version against previous version on the same golden dataset
- Automate objective criteria (format, schema, length) — human-review subjective quality
- Treat a failed regression test as a stop sign, not a suggestion
Avoid This
- Do not promote prompt changes based on one or two spot-checks
- Do not rely on production users to discover regressions
- Do not skip regression testing for PATCH changes — "trivial" changes cause regressions
- Do not accept a regression on one dimension because a different dimension improved