A/B Test Statistics

Non-negotiable literacy. Hypothesis testing, p-values, sample size, sequential testing, CUPED, multiple comparisons, SRM, interpretation.

What you will learn

Why A/B test statistics literacy is non-negotiable
Hypothesis testing fundamentals
P-values: what they are and aren't
Sample size and power
Sequential testing and peeking
Variance reduction (CUPED)
Multiple comparisons
SRM and validity checks
Interpretation and communication
Advanced playbook
Common mistakes
Operating checklist

Why this literacy matters

Most marketing teams run A/B tests. Few do them with statistical rigor. The result: false positives shipped as wins, true wins dismissed as noise, and decisions made on misleading data.

The good news: the math is learnable. Marketers don't need to be statisticians. They need enough literacy to interpret tool outputs and recognize when methodology is failing them.

Hypothesis testing fundamentals

Null hypothesis (H0): Variant performs same as control.
Alternative hypothesis (H1): Variant differs.
Test statistic: Quantifies the difference.
P-value: Probability of observing this difference (or more extreme) under H0.
Significance level (α): Threshold for rejecting H0; conventionally 0.05.
Power (1-β): Probability of detecting a true effect of given size; conventionally 0.80.

P-values

P < 0.05 means: If H0 were true, we'd see this data 5% of the time.
P-value isn't: Probability that H0 is true. Probability the variant is "real." Effect size.
P = 0.04 doesn't mean strong effect. Just that data is unlikely under H0.
Common misuse: Treating p < 0.05 as proof of business significance; ignoring effect size.

Sample size and power

Two-proportion test sample size formula:

n = ((z_1-α/2 + z_1-β)² × (p₁(1-p₁) + p₂(1-p₂))) / (p₁-p₂)²

Practical guidance

5% baseline CVR, detecting 10% relative lift: ~30,000 per variant.
2% baseline CVR, detecting 20% relative lift: ~20,000 per variant.
1% baseline CVR, detecting 10% relative lift: ~155,000 per variant.
Tools: Evan Miller's sample size calculator; statsig/Optimizely calculators.

Minimum detectable effect (MDE)

Given sample size and baseline, what's the smallest effect you can reliably detect?
If MDE is 25% and you expect 10% lift, you'll see noise.
Calculate MDE before launching tests.

Sequential testing and peeking

The peeking problem: Checking results daily and stopping when significant inflates false positives.
Why: Random fluctuations look significant temporarily; stopping then captures noise.
Sequential testing methods: Always-valid p-values (Optimizely Stats Engine), group sequential, mSPRT (Statsig, Eppo), Bayesian sequential.
Practical: Either commit to fixed-horizon test, or use a sequential-valid method. Never peek-and-stop in frequentist.

Variance reduction (CUPED)

CUPED: Controlled-experiment Using Pre-Experiment Data.
Method: Use pre-experiment metrics as covariates to reduce variance.
Effect: 20–40% sample size reduction for many metrics.
Particularly useful for: High-variance metrics (revenue per visitor, time on site).
Implementation: Modern experimentation platforms (Statsig, Eppo) include CUPED.

Multiple comparisons

Running 20 tests at α=0.05 = expect ~1 false positive by chance.
Bonferroni correction: Divide α by number of comparisons. Conservative.
Benjamini-Hochberg (FDR): Less conservative; controls expected proportion of false discoveries.
A/B/n tests: Apply correction.
Subgroup analysis: Treat as exploratory unless pre-registered.

SRM and validity checks

SRM (Sample Ratio Mismatch): Actual variant ratio differs significantly from expected (e.g., 48/52 when 50/50 designed).
Signals: Randomization issue, tracking issue, audience filter problem.
Always check before trusting results. Modern platforms automate this.
Pre-experiment A/A tests: Verify no spurious effects in control vs control.
Distribution comparisons: Treatment and control should have similar pre-test metrics.

Interpretation and communication

Report effect size, not just significance. "5% lift, 95% CI 2–8%" better than "significant."
Distinguish statistical and business significance. A statistically significant 0.5% lift may not be worth shipping.
Confidence intervals over point estimates.
Acknowledge uncertainty. Tests are samples; truth is broader.
Communicate methodology. Stakeholders should know what was tested and how.

Advanced playbook

Modern sequential-valid platforms. Statsig, Eppo, Optimizely Stats Engine eliminate peeking risk.
CUPED for high-variance metrics. Revenue per visitor, time on site benefit most.
SRM automation. Block test results if SRM detected; investigate first.
Power calculations pre-launch. Don't launch underpowered.
Effect size and CI in every report. Not just "significant".
Pre-registration discipline. Primary metric, MDE, duration, analysis approach documented before launch.
Statistical literacy training. Marketing team understands p-values, CI, sample size.
Multiple comparison handling. Document approach (Bonferroni, FDR, or pre-registered single primary).
Heterogeneous treatment effects. CATE methodology for subgroups; not naive p-values.
External statistical review. High-stakes tests reviewed by data scientist or statistician.

Common mistakes

Peeking and stopping; false positives.
P-value treated as probability variant is real.
Significance reported without effect size.
Sample size not calculated pre-launch.
SRM ignored; randomization issues unnoticed.
Multiple comparisons without correction.
Post-hoc subgroup analysis as primary finding.
Confidence intervals omitted.
Statistical significance confused with business significance.
CUPED ignored on high-variance metrics; overpowered tests required.
Sequential-valid methodology not used; frequentist peeking.
Pre-registration skipped; post-hoc rationalization.

Operating checklist

Sequential-valid testing methodology if peeking enabled
Sample size calculated pre-launch
MDE documented before test launch
SRM check automated
CUPED on high-variance metrics
Multiple comparison correction documented
Pre-registration: primary metric, MDE, duration, analysis
Effect size and CI in reports
Distinction between statistical and business significance
Subgroup analyses flagged as exploratory
Statistical literacy training for stakeholders
Annual methodology review

Sources and further reading

Ron Kohavi, "Trustworthy Online Controlled Experiments" — the textbook
Microsoft Experimentation Platform research
Booking.com data science publications
Statsig and Eppo blog — methodology articles
Optimizely Stats Engine documentation
Andrew Gelman blog
Evan Miller — sample size calculators
Frank Harrell — statistical methodology
CXL Institute statistics for CRO courses
Microsoft Experimentation Platform CUPED paper
RGM CRO Experimentation statistical-methods-and-sample-size module
Reforge analytics curriculum

Part of the Marketing Analytics series.