A/B Power Calculator

Sample size, MDE, and runtime with CUPED/CUPAC variance reduction, multi-arm corrections, sequential monitoring, and design effects.

Use this tool to plan online experiments:

These are normal-approximation calculations (Wald test). They work well for large-scale online experiments but be cautious with:

  • Very low conversion rates (< 0.5%) — consider exact tests
  • Heavy-tailed metrics (revenue, session duration) — consider bootstrap or robust methods
  • Sequential monitoring — the boundaries below are approximate; use a proper spending function library for production
  • CUPED/CUPAC — the variance-reduction % is an estimate; actual reduction depends on covariate predictive power

Reference

CUPED (Controlled-experiment Using Pre-Experiment Data) reduces variance by regressing out pre-experiment covariate information. If the covariate explains ρ² of the outcome variance, effective variance drops by a factor of (1 − ρ²), which is equivalent to multiplying your sample size by 1/(1 − ρ²).

Common variance-reduction magnitudes:

Covariate Typical ρ² VR (%)
Pre-period of same metric (1 wk) 0.15–0.30 15–30%
Pre-period of same metric (4 wk) 0.25–0.50 25–50%
CUPAC (ML model of outcome) 0.30–0.60 30–60%
Multiple covariates (MLRATE) 0.40–0.70 40–70%

Tip: run a pre-experiment analysis regressing Y on your covariates to estimate before committing to a VR assumption.

For cluster-randomized experiments (e.g., randomizing at the market, store, or page level), the effective sample size is reduced by the design effect:

\[\text{DEFF} = 1 + (m - 1)\rho\]

where m = average cluster size and ρ = intraclass correlation (ICC). The SE multiplier is √DEFF.

Scenario Typical ICC m SE mult
Users within geo-markets 0.001–0.01 500 1.1–1.6
Sessions within users 0.05–0.15 10 1.2–1.5
Students within classrooms 0.10–0.25 25 1.5–2.5

When testing k treatment arms against a single control, the family-wise error rate (FWER) inflates. Common corrections:

  • Bonferroni: α* = α/k. Conservative but simple.
  • Šidák: α* = 1 − (1 − α)^(1/k). Slightly less conservative; assumes independence.
  • Dunnett: exact correction for many-to-one comparisons (approximated here). Accounts for correlation from shared control.

In practice, Dunnett is preferred for the many-to-one comparison structure typical in A/B/n tests.

Cohen’s d (continuous metrics): 0.2 = small, 0.5 = medium, 0.8 = large.

Cohen’s h (proportions): uses the arcsine transformation, h = 2 arcsin(√p₁) − 2 arcsin(√p₀). Same thresholds as d.

In online experiments, effects are typically small (d < 0.1). A 2% relative lift on a 10% conversion rate gives h ≈ 0.03 — firmly in “small” territory, which is why large sample sizes are needed.

O’Brien–Fleming (OBF): conservative early boundaries (very hard to reject early), aggressive later. Nominal α at the final look is close to the unadjusted level. Preferred when you don’t expect to stop early but want the option.

Pocock: constant boundaries across looks. Easier to reject early but requires a higher bar at the final look. Preferred when early stopping is a realistic goal.

Both are implemented here as approximations. For production sequential designs, use a proper alpha-spending function (Lan–DeMets) via packages like gsDesign (R), sequential (R), or statsmodels (Python).