PM-201c · Module 3

A/B Testing Prompts

4 min read

The engineering discipline of prompt improvement is not "I think this version is better." It is "here is the experiment design, here is the success metric, here is the data." A/B testing applies to prompts exactly as it applies to any other system component: define a control, define a variant, split traffic between them, measure against a pre-specified success metric, and let the data determine which version wins.

1. Define the success metric Before starting the experiment, define what "better" means in a measurable way. Format compliance rate? User satisfaction score? Error rate reduction? Output length reduction with equivalent quality? A success metric defined after seeing results is not a success metric — it is a post-hoc justification.
2. Define the traffic split What percentage of production traffic runs on the control (current version) and the variant (new version)? Start conservative: 90/10 or 80/20. If early results look problematic, cut the variant traffic. Only expand the variant when early data is positive.
3. Run for sufficient volume How many samples do you need before results are statistically meaningful? For binary outcomes (pass/fail), 200 samples per variant is a reasonable minimum for detecting a 10% difference with 95% confidence. Lower-frequency failures require more volume. Do not stop the experiment early because early results look good.
4. Evaluate and decide At the end of the experiment period, compare control and variant on the pre-specified success metric. If the variant meets the success threshold, promote it. If not, investigate why and iterate. Document the experiment and its results regardless of outcome — negative results are as valuable as positive ones.

A/B Test: sales-followup-discovery v3.0 vs v3.1
Start date: 2026-03-01
End date: 2026-03-15
Traffic split: 80% control (v3.0), 20% variant (v3.1)

Hypothesis:
  Adding explicit tone constraints (no contractions, no enthusiasm inflation)
  will improve client satisfaction scores and reduce revision requests.

Success metric:
  Primary: Client satisfaction score (CSAT) ≥ 4.0/5.0 on variant
    (vs. current 3.7/5.0 baseline on control)
  Secondary: Revision request rate ≤ 8% on variant
    (vs. current 12% baseline on control)

Disqualifying condition:
  Format compliance < 95% on variant → immediate rollback

Results (at 14 days, 312 variant samples, 1,248 control samples):
  Variant CSAT: 4.2/5.0 ✓
  Variant revision rate: 7.4% ✓
  Format compliance: 98.4% ✓

Decision: Promote v3.1 to 100% production traffic.
Effective date: 2026-03-16
Result logged in changelog: v3.1 promoted after A/B test. See test-20260301.