---
title: A/B Test Statistics — RGM Training
url: https://realgrowthmatters.com/training/marketing-analytics/ab-test-statistics/
updated: 2026-06-10
source_html: https://realgrowthmatters.com/training/marketing-analytics/ab-test-statistics/
---

[Home](../../../index.html) › [Training](../../index.html) › [Marketing Analytics](../index.html) › A/B Test Statistics

RGM° · Training

# A/B Test Statistics

Non-negotiable literacy. Hypothesis testing, p-values, sample size, sequential testing, CUPED, multiple comparisons, SRM, interpretation.

### What you will learn

1. [Why A/B test statistics literacy is non-negotiable](#why)
2. [Hypothesis testing fundamentals](#fundamentals)
3. [P-values: what they are and aren't](#p-values)
4. [Sample size and power](#sample-size)
5. [Sequential testing and peeking](#sequential)
6. [Variance reduction (CUPED)](#variance)
7. [Multiple comparisons](#multiple)
8. [SRM and validity checks](#srm)
9. [Interpretation and communication](#interpretation)
10. [Advanced playbook](#advanced)
11. [Common mistakes](#mistakes)
12. [Operating checklist](#checklist)

## Why this literacy matters

Most marketing teams run A/B tests. Few do them with statistical rigor. The result: false positives shipped as wins, true wins dismissed as noise, and decisions made on misleading data.

The good news: the math is learnable. Marketers don't need to be statisticians. They need enough literacy to interpret tool outputs and recognize when methodology is failing them.

## Hypothesis testing fundamentals

- **Null hypothesis (H0):** Variant performs same as control.
- **Alternative hypothesis (H1):** Variant differs.
- **Test statistic:** Quantifies the difference.
- **P-value:** Probability of observing this difference (or more extreme) under H0.
- **Significance level (α):** Threshold for rejecting H0; conventionally 0.05.
- **Power (1-β):** Probability of detecting a true effect of given size; conventionally 0.80.

## P-values

- **P < 0.05 means:** If H0 were true, we'd see this data 5% of the time.
- **P-value isn't:** Probability that H0 is true. Probability the variant is "real." Effect size.
- **P = 0.04 doesn't mean strong effect.** Just that data is unlikely under H0.
- **Common misuse:** Treating p < 0.05 as proof of business significance; ignoring effect size.

## Sample size and power

Two-proportion test sample size formula:

**n = ((z1-α/2 + z1-β)2 × (p1(1-p1) + p2(1-p2))) / (p1-p2)2**

### Practical guidance

- 5% baseline CVR, detecting 10% relative lift: ~30,000 per variant.
- 2% baseline CVR, detecting 20% relative lift: ~20,000 per variant.
- 1% baseline CVR, detecting 10% relative lift: ~155,000 per variant.
- Tools: Evan Miller's sample size calculator; statsig/Optimizely calculators.

### Minimum detectable effect (MDE)

- Given sample size and baseline, what's the smallest effect you can reliably detect?
- If MDE is 25% and you expect 10% lift, you'll see noise.
- Calculate MDE before launching tests.

## Sequential testing and peeking

- **The peeking problem:** Checking results daily and stopping when significant inflates false positives.
- **Why:** Random fluctuations look significant temporarily; stopping then captures noise.
- **Sequential testing methods:** Always-valid p-values (Optimizely Stats Engine), group sequential, mSPRT (Statsig, Eppo), Bayesian sequential.
- **Practical:** Either commit to fixed-horizon test, or use a sequential-valid method. Never peek-and-stop in frequentist.

## Variance reduction (CUPED)

- **CUPED:** Controlled-experiment Using Pre-Experiment Data.
- **Method:** Use pre-experiment metrics as covariates to reduce variance.
- **Effect:** 20–40% sample size reduction for many metrics.
- **Particularly useful for:** High-variance metrics (revenue per visitor, time on site).
- **Implementation:** Modern experimentation platforms (Statsig, Eppo) include CUPED.

## Multiple comparisons

- Running 20 tests at α=0.05 = expect ~1 false positive by chance.
- **Bonferroni correction:** Divide α by number of comparisons. Conservative.
- **Benjamini-Hochberg (FDR):** Less conservative; controls expected proportion of false discoveries.
- **A/B/n tests:** Apply correction.
- **Subgroup analysis:** Treat as exploratory unless pre-registered.

## SRM and validity checks

- **SRM (Sample Ratio Mismatch):** Actual variant ratio differs significantly from expected (e.g., 48/52 when 50/50 designed).
- **Signals:** Randomization issue, tracking issue, audience filter problem.
- **Always check before trusting results.** Modern platforms automate this.
- **Pre-experiment A/A tests:** Verify no spurious effects in control vs control.
- **Distribution comparisons:** Treatment and control should have similar pre-test metrics.

## Interpretation and communication

- **Report effect size, not just significance.** "5% lift, 95% CI 2–8%" better than "significant."
- **Distinguish statistical and business significance.** A statistically significant 0.5% lift may not be worth shipping.
- **Confidence intervals over point estimates.**
- **Acknowledge uncertainty.** Tests are samples; truth is broader.
- **Communicate methodology.** Stakeholders should know what was tested and how.

## Advanced playbook

- **Modern sequential-valid platforms.** Statsig, Eppo, Optimizely Stats Engine eliminate peeking risk.
- **CUPED for high-variance metrics.** Revenue per visitor, time on site benefit most.
- **SRM automation.** Block test results if SRM detected; investigate first.
- **Power calculations pre-launch.** Don't launch underpowered.
- **Effect size and CI in every report.** Not just "significant".
- **Pre-registration discipline.** Primary metric, MDE, duration, analysis approach documented before launch.
- **Statistical literacy training.** Marketing team understands p-values, CI, sample size.
- **Multiple comparison handling.** Document approach (Bonferroni, FDR, or pre-registered single primary).
- **Heterogeneous treatment effects.** CATE methodology for subgroups; not naive p-values.
- **External statistical review.** High-stakes tests reviewed by data scientist or statistician.

## Common mistakes

- Peeking and stopping; false positives.
- P-value treated as probability variant is real.
- Significance reported without effect size.
- Sample size not calculated pre-launch.
- SRM ignored; randomization issues unnoticed.
- Multiple comparisons without correction.
- Post-hoc subgroup analysis as primary finding.
- Confidence intervals omitted.
- Statistical significance confused with business significance.
- CUPED ignored on high-variance metrics; overpowered tests required.
- Sequential-valid methodology not used; frequentist peeking.
- Pre-registration skipped; post-hoc rationalization.

## Operating checklist

- Sequential-valid testing methodology if peeking enabled
- Sample size calculated pre-launch
- MDE documented before test launch
- SRM check automated
- CUPED on high-variance metrics
- Multiple comparison correction documented
- Pre-registration: primary metric, MDE, duration, analysis
- Effect size and CI in reports
- Distinction between statistical and business significance
- Subgroup analyses flagged as exploratory
- Statistical literacy training for stakeholders
- Annual methodology review

## Sources and further reading

- Ron Kohavi, "Trustworthy Online Controlled Experiments" — the textbook
- Microsoft Experimentation Platform research
- Booking.com data science publications
- Statsig and Eppo blog — methodology articles
- Optimizely Stats Engine documentation
- Andrew Gelman blog
- Evan Miller — sample size calculators
- Frank Harrell — statistical methodology
- CXL Institute statistics for CRO courses
- Microsoft Experimentation Platform CUPED paper
- RGM CRO Experimentation statistical-methods-and-sample-size module
- Reforge analytics curriculum

---

Part of the [Marketing Analytics](../index.html) series.
