PM-201c · Module 2

Regression Testing

3 min read

A regression is a change that degrades performance on a task the previous version handled correctly. In software, regressions are caught by test suites run before deployment. In prompt systems, regressions are caught by running the updated prompt against the golden dataset before promoting it to production. The test suite is the golden dataset. The test runner is whoever made the change. The gate is the approval threshold.

1. Run before any promotion No prompt version moves from staging to production without running against the golden dataset. This applies to MAJOR, MINOR, and PATCH changes — even "trivial" wording changes can produce unexpected output changes on edge-case inputs.
2. Compare against the previous version Run the previous version and the new version against the same dataset. Compare results on every sample. Note any samples where the new version performs worse than the previous, even if both pass the approval threshold. Regression in one area may be masked by improvement in another.
3. Document the results Record the test results in the changelog: how many samples passed, any failures and their classification, and whether the threshold was met. This is the evidence that the change was tested and what it found.
4. Automate where possible For format compliance and schema validation checks, automate the evaluation — a script can verify JSON schema compliance, word count, required field presence, and other objective criteria without human review. Reserve human review for quality dimensions that require judgment.

Do This

Run regression tests for every version change before production promotion
Compare new version against previous version on the same golden dataset
Automate objective criteria (format, schema, length) — human-review subjective quality
Treat a failed regression test as a stop sign, not a suggestion

Avoid This

Do not promote prompt changes based on one or two spot-checks
Do not rely on production users to discover regressions
Do not skip regression testing for PATCH changes — "trivial" changes cause regressions
Do not accept a regression on one dimension because a different dimension improved