PM-301i · Module 3

A/B Testing in Production

4 min read

An A/B test for prompts is a controlled experiment: a percentage of production traffic receives prompt version A (the control), the rest receives prompt version B (the treatment), and both variants are evaluated against the same success metrics over a sufficient duration to reach statistical significance. The experiment concludes when you have data that meets the statistical threshold, not when you have opinions about which version "feels better."

A/B testing answers questions that regression testing cannot. Regression testing tells you whether the new prompt passes the golden dataset. A/B testing tells you whether the new prompt performs better with real users on real tasks in production conditions. These are different questions. A prompt can pass regression testing and underperform in A/B because the golden dataset does not fully represent production inputs, or because the success metric in the eval does not fully capture what makes outputs useful to users.

Traffic allocation in a prompt A/B test uses the same feature flag infrastructure as graduated rollouts. The flag routes X% of traffic to variant B, the remainder stays on variant A. This allocation must be sticky: the same user (or the same session, or the same request origin) should consistently receive the same variant. Non-sticky allocation confounds the experiment because users may receive different variants across a single workflow, making attribution of success or failure to either variant impossible.

1. Define the Hypothesis Write the hypothesis before running the experiment: "We believe that [change to prompt] will [increase/decrease] [metric] by [magnitude] because [reason]." A hypothesis without a magnitude is not falsifiable. "We think it will be better" is not a hypothesis.
2. Define Success Metrics and Minimum Effect Size What metric will you use to evaluate the experiment? What minimum improvement constitutes a meaningful win? Decide these before the experiment runs. Metrics chosen after seeing results are p-hacking.
3. Calculate Required Sample Size Based on the baseline metric value, the minimum detectable effect, and the desired statistical power, calculate how many requests are required per variant before the experiment can conclude. Do not stop early.
4. Run the Experiment Allocate traffic using sticky feature flags. Collect data. Do not peek at interim results with the intent to stop early if results look good. Peeking and early stopping inflates false positive rates.
5. Analyze and Decide When the required sample size is reached, analyze the data. Did variant B outperform A on the primary metric? Was the difference statistically significant? Was there any regression on secondary metrics? The decision follows from the data.