Marketing Analytics
RGM° · Training
A/B Test Statistics
Non-negotiable literacy. Hypothesis testing, p-values, sample size, sequential testing, CUPED, multiple comparisons, SRM, interpretation.
Why this literacy matters
Most marketing teams run A/B tests. Few do them with statistical rigor. The result: false positives shipped as wins, true wins dismissed as noise, and decisions made on misleading data.
The good news: the math is learnable. Marketers don't need to be statisticians. They need enough literacy to interpret tool outputs and recognize when methodology is failing them.
Hypothesis testing fundamentals
- Null hypothesis (H0): Variant performs same as control.
- Alternative hypothesis (H1): Variant differs.
- Test statistic: Quantifies the difference.
- P-value: Probability of observing this difference (or more extreme) under H0.
- Significance level (α): Threshold for rejecting H0; conventionally 0.05.
- Power (1-β): Probability of detecting a true effect of given size; conventionally 0.80.
P-values
- P < 0.05 means: If H0 were true, we'd see this data 5% of the time.
- P-value isn't: Probability that H0 is true. Probability the variant is "real." Effect size.
- P = 0.04 doesn't mean strong effect. Just that data is unlikely under H0.
- Common misuse: Treating p < 0.05 as proof of business significance; ignoring effect size.
Sample size and power
Two-proportion test sample size formula:
n = ((z1-α/2 + z1-β)2 × (p1(1-p1) + p2(1-p2))) / (p1-p2)2
Practical guidance
- 5% baseline CVR, detecting 10% relative lift: ~30,000 per variant.
- 2% baseline CVR, detecting 20% relative lift: ~20,000 per variant.
- 1% baseline CVR, detecting 10% relative lift: ~155,000 per variant.
- Tools: Evan Miller's sample size calculator; statsig/Optimizely calculators.
Minimum detectable effect (MDE)
- Given sample size and baseline, what's the smallest effect you can reliably detect?
- If MDE is 25% and you expect 10% lift, you'll see noise.
- Calculate MDE before launching tests.
Sequential testing and peeking
- The peeking problem: Checking results daily and stopping when significant inflates false positives.
- Why: Random fluctuations look significant temporarily; stopping then captures noise.
- Sequential testing methods: Always-valid p-values (Optimizely Stats Engine), group sequential, mSPRT (Statsig, Eppo), Bayesian sequential.
- Practical: Either commit to fixed-horizon test, or use a sequential-valid method. Never peek-and-stop in frequentist.
Variance reduction (CUPED)
- CUPED: Controlled-experiment Using Pre-Experiment Data.
- Method: Use pre-experiment metrics as covariates to reduce variance.
- Effect: 20–40% sample size reduction for many metrics.
- Particularly useful for: High-variance metrics (revenue per visitor, time on site).
- Implementation: Modern experimentation platforms (Statsig, Eppo) include CUPED.
Multiple comparisons
- Running 20 tests at α=0.05 = expect ~1 false positive by chance.
- Bonferroni correction: Divide α by number of comparisons. Conservative.
- Benjamini-Hochberg (FDR): Less conservative; controls expected proportion of false discoveries.
- A/B/n tests: Apply correction.
- Subgroup analysis: Treat as exploratory unless pre-registered.
SRM and validity checks
- SRM (Sample Ratio Mismatch): Actual variant ratio differs significantly from expected (e.g., 48/52 when 50/50 designed).
- Signals: Randomization issue, tracking issue, audience filter problem.
- Always check before trusting results. Modern platforms automate this.
- Pre-experiment A/A tests: Verify no spurious effects in control vs control.
- Distribution comparisons: Treatment and control should have similar pre-test metrics.
Interpretation and communication
- Report effect size, not just significance. "5% lift, 95% CI 2–8%" better than "significant."
- Distinguish statistical and business significance. A statistically significant 0.5% lift may not be worth shipping.
- Confidence intervals over point estimates.
- Acknowledge uncertainty. Tests are samples; truth is broader.
- Communicate methodology. Stakeholders should know what was tested and how.
Advanced playbook
- Modern sequential-valid platforms. Statsig, Eppo, Optimizely Stats Engine eliminate peeking risk.
- CUPED for high-variance metrics. Revenue per visitor, time on site benefit most.
- SRM automation. Block test results if SRM detected; investigate first.
- Power calculations pre-launch. Don't launch underpowered.
- Effect size and CI in every report. Not just "significant".
- Pre-registration discipline. Primary metric, MDE, duration, analysis approach documented before launch.
- Statistical literacy training. Marketing team understands p-values, CI, sample size.
- Multiple comparison handling. Document approach (Bonferroni, FDR, or pre-registered single primary).
- Heterogeneous treatment effects. CATE methodology for subgroups; not naive p-values.
- External statistical review. High-stakes tests reviewed by data scientist or statistician.
Common mistakes
- Peeking and stopping; false positives.
- P-value treated as probability variant is real.
- Significance reported without effect size.
- Sample size not calculated pre-launch.
- SRM ignored; randomization issues unnoticed.
- Multiple comparisons without correction.
- Post-hoc subgroup analysis as primary finding.
- Confidence intervals omitted.
- Statistical significance confused with business significance.
- CUPED ignored on high-variance metrics; overpowered tests required.
- Sequential-valid methodology not used; frequentist peeking.
- Pre-registration skipped; post-hoc rationalization.
Operating checklist
- Sequential-valid testing methodology if peeking enabled
- Sample size calculated pre-launch
- MDE documented before test launch
- SRM check automated
- CUPED on high-variance metrics
- Multiple comparison correction documented
- Pre-registration: primary metric, MDE, duration, analysis
- Effect size and CI in reports
- Distinction between statistical and business significance
- Subgroup analyses flagged as exploratory
- Statistical literacy training for stakeholders
- Annual methodology review
Sources and further reading
- Ron Kohavi, "Trustworthy Online Controlled Experiments" — the textbook
- Microsoft Experimentation Platform research
- Booking.com data science publications
- Statsig and Eppo blog — methodology articles
- Optimizely Stats Engine documentation
- Andrew Gelman blog
- Evan Miller — sample size calculators
- Frank Harrell — statistical methodology
- CXL Institute statistics for CRO courses
- Microsoft Experimentation Platform CUPED paper
- RGM CRO Experimentation statistical-methods-and-sample-size module
- Reforge analytics curriculum
Part of the Marketing Analytics series.