PM-301h · Module 3

Statistical Significance in Prompt Testing

4 min read

Prompt A passes 91% of the golden dataset. Prompt B passes 94%. Is B better? Maybe. It depends on the size of the dataset and the variance of the outputs. With a 50-case dataset, the difference between 91% and 94% is 1.5 cases. One case. That difference is not statistically significant. You cannot conclude B is better from that data.

This matters because teams routinely make promotion decisions based on pass rate differences that are within the noise floor. Prompt A fails on 4.5 cases per 50. Prompt B fails on 3. The improvement is real in the sample but may not reflect real-world performance because the sample is too small. At 300 cases, 91% vs 94% is 9 cases — that difference is likely real. At 50 cases, it is noise.

Practical guidance: evaluate statistical significance using a two-proportion z-test. The null hypothesis is that both prompts have the same underlying pass rate. The question is whether the observed difference is large enough, given the dataset size, to reject that null hypothesis at a 95% confidence threshold.

MINIMUM DATASET SIZES FOR RELIABLE COMPARISON

To detect a 5% improvement (e.g., 85% → 90%) with 80% power at 95% confidence:
  Required cases: ~350 per variant

To detect a 10% improvement (e.g., 80% → 90%) with 80% power at 95% confidence:
  Required cases: ~100 per variant

To detect a 3% improvement (e.g., 87% → 90%) with 80% power at 95% confidence:
  Required cases: ~850 per variant

PRACTICAL THRESHOLDS FOR PM TEAMS (simplified)

Dataset size | Minimum meaningful difference
   50 cases  | ≥ 10% (5 cases)
  100 cases  | ≥ 7%  (7 cases)
  200 cases  | ≥ 5%  (10 cases)
  500 cases  | ≥ 3%  (15 cases)

DECISION RULE
If the observed improvement falls below the minimum meaningful difference
for your dataset size: do not make a promotion decision based on that data.
Run more cases, or accept that the difference cannot be confirmed.

WITHIN-CATEGORY SIGNIFICANCE
Report significance within each category (standard, edge, adversarial)
separately. A 10% improvement in standard cases with no change in edge
cases is a different decision than a 10% improvement across all categories.

NOTE: These are approximations for non-statisticians. For high-stakes
prompt changes (medical, legal, financial), consult a statistician and
use the full z-test calculation.