Ab Test Sample Size Calculator Formula

AB Test Planning Tool

AB Test Sample Size Calculator Formula

Estimate how many users you need in each variant before launching a conversion rate experiment. Adjust baseline conversion, minimum detectable effect, confidence level, and power to calculate the required sample size for a two-proportion A/B test.

Current conversion rate for your control page, email, or checkout flow.
Relative uplift you want to detect. Example: 10 means detect a lift from 10.0% to 11.0%.
Higher confidence reduces false positives but increases required traffic.
Higher power reduces false negatives and requires a larger sample.
Two-sided is standard when either variant could win or lose.
Uneven splits are useful when reducing risk to a challenger variant.
Used to estimate the run time of the experiment.
Rounding up is the safer choice in experiment planning.
Ready to calculate. Enter your assumptions and click Calculate Sample Size.
This calculator uses a standard two-sample proportion test approximation for A/B tests. Results are planning estimates, not a substitute for a full experimental design review.

How the A/B test sample size calculator formula works

An A/B test sample size calculator answers one practical question before you launch an experiment: how many users do you need in each variant to reliably detect a meaningful difference? If you start a test with too little traffic, even a genuinely better variation can look inconclusive. If you wait for far more users than necessary, you delay decisions, waste opportunity, and may expose customers to an underperforming experience longer than needed.

For most website, product, email, and checkout experiments, the sample size problem is based on comparing two conversion rates. The control group has an expected conversion rate, often called the baseline conversion rate. The variant group has a target rate that reflects your minimum detectable effect, usually shortened to MDE. The calculator then combines those assumptions with your chosen confidence level and statistical power.

In practical terms, these inputs mean the following:

  • Baseline conversion rate: the expected performance of your control experience.
  • Minimum detectable effect: the smallest relative lift you care enough to detect.
  • Confidence level: the strictness of your false-positive threshold.
  • Statistical power: the probability of detecting a real effect when it truly exists.
  • Traffic allocation: the share of users sent to control versus variation.

The core formula

For a classic two-sample test of proportions with equal group sizes, one widely used approximation for the required sample size per group is:

n = [ ( Z(alpha) * sqrt(2 * p-bar * (1 – p-bar)) + Z(beta) * sqrt(p1 * (1 – p1) + p2 * (1 – p2)) ) ^ 2 ] / (p2 – p1) ^ 2

Where:

  • p1 is the baseline conversion rate.
  • p2 is the expected conversion rate under the variant.
  • p-bar is the average of p1 and p2.
  • Z(alpha) is the z-score implied by your significance threshold.
  • Z(beta) is the z-score implied by your chosen power.

If you use a two-sided 95% confidence level, the z-score for alpha is about 1.96. If you use 80% power, the z-score for beta is about 0.84. Those two values heavily influence the final sample size. Increasing confidence or power pushes the required number upward.

Why small effects need much larger samples

The denominator of the formula is the squared difference between the two conversion rates. That means the required sample grows very quickly as the effect size gets smaller. If your baseline conversion rate is 10% and you want to detect an uplift to 11%, the absolute difference is only 1 percentage point. Because the formula squares that difference, the sample requirement becomes much larger than if you were trying to detect a jump from 10% to 15%.

This is one of the biggest reasons experimentation programs struggle. Teams often care about subtle lifts, such as a 3% to 5% relative improvement, but they do not have enough daily traffic to support those goals in a reasonable time frame. The calculator helps you see that tradeoff immediately.

Key inputs explained in plain language

1. Baseline conversion rate

Your baseline conversion rate should come from clean historical data. If your conversion rate changes significantly by weekday, device type, acquisition source, or season, using a single blended figure can mislead your estimate. In that case, either plan your test around the highest-priority segment or use a stable recent average over a representative period.

For example, a homepage signup test might have a baseline rate of 6.5%, while a checkout completion test may have a baseline closer to 38%. The formula behaves differently at different baselines because the variance of a proportion depends on p × (1 – p).

2. Minimum detectable effect

The minimum detectable effect is not a prediction. It is a business decision. It represents the smallest lift worth acting on. If a change would need to improve conversion by at least 7% relative to justify engineering effort, rollout risk, or design complexity, then 7% is a reasonable MDE. If your MDE is unrealistically tiny, your sample size may become too large to run.

A common mistake is choosing an MDE because it feels ambitious rather than because it is economically meaningful. The best teams tie MDE to revenue impact, product strategy, or operational cost.

3. Confidence level and significance

A 95% confidence level corresponds to a 5% significance level in a standard two-sided test. That means you are willing to accept roughly a 5% chance of a false positive under the model assumptions. In many commercial A/B testing programs, 95% is the default because it balances caution with practicality. More stringent thresholds, such as 99%, are sometimes used for higher-risk launches, but they require larger samples.

4. Statistical power

Power is the probability that your test will detect the effect size you care about if that effect truly exists. An 80% powered test is common. A 90% powered test reduces the chance of a false negative, but again increases the required sample size. If your organization frequently makes expensive product decisions based on experimental results, increasing power may be justified.

Sample size comparison table for common A/B testing scenarios

The table below shows realistic sample size estimates per variant for a two-sided test at 95% confidence and 80% power, assuming a 50/50 split. These are rounded planning figures and can vary slightly by method, continuity correction, or software implementation.

Baseline rate Relative uplift Expected variant rate Absolute difference Approx. users per variant Total users needed
5.0% 10% 5.5% 0.5 percentage points 31,100 62,200
10.0% 10% 11.0% 1.0 percentage point 14,700 29,400
20.0% 10% 22.0% 2.0 percentage points 6,100 12,200
10.0% 20% 12.0% 2.0 percentage points 3,800 7,600
30.0% 5% 31.5% 1.5 percentage points 13,300 26,600

The pattern is clear: smaller lifts and lower baselines often demand far more traffic. This is why a tiny checkout funnel experiment on a low-volume page can take weeks, while a bold pricing-page test on a high-traffic path might conclude much faster.

How to estimate test duration

Once you know the required sample size, the next step is estimating test length. If your calculator says you need 20,000 total users and your experimentable traffic is 5,000 users per day, your shortest theoretical duration is about four days. In reality, most teams should run longer than the raw arithmetic suggests so that the test spans full weekly cycles and normal traffic variation. As a rule of thumb, many experimentation teams avoid ending tests before at least one to two business cycles unless traffic is extremely stable.

  1. Calculate required total users across both variants.
  2. Divide by estimated daily eligible traffic.
  3. Add buffer for traffic volatility, implementation delays, and quality checks.
  4. Ensure the test covers weekday and weekend behavior if relevant.

Real-world tradeoffs that affect the formula

Uneven traffic allocation

The cleanest statistical design is usually a 50/50 split because it minimizes variance for a fixed total sample. However, product teams sometimes use uneven splits like 90/10 or 67/33 to reduce risk on a new experience. That is valid operationally, but it increases total sample requirements. If your traffic is limited, an uneven split can materially lengthen the experiment.

Multiple metrics and multiple testing

If your organization watches many metrics and repeatedly peeks at results, your actual false-positive risk can exceed the nominal alpha level from the simple formula. In those cases, more advanced methods such as alpha spending, false discovery control, or sequential testing may be appropriate. The calculator on this page is excellent for initial planning, but governance still matters.

Binary conversion versus continuous outcomes

This formula is designed for binary outcomes such as purchased or not purchased, clicked or not clicked, signed up or not signed up. If your primary metric is revenue per visitor, average order value, or session duration, the sample size problem changes because you are no longer modeling a simple proportion. The variance structure is different, and you may need a t-test based or bootstrap-based planning method.

Comparison table: confidence and power settings

The next table shows how changing statistical settings can affect sample size, holding the baseline at 10% and the target uplift at 10% relative, with equal allocation.

Confidence level Power Approx. users per variant Total users Planning implication
90% 80% 11,500 23,000 Faster decisions, higher false-positive tolerance
95% 80% 14,700 29,400 Common default in business experimentation
95% 90% 19,700 39,400 More protection against false negatives
99% 80% 23,100 46,200 Useful for high-risk or highly visible changes

Best practices for using an A/B test sample size calculator

  • Use recent, representative baseline data. Old conversion rates from a different season or channel mix can distort your planning.
  • Choose an economically meaningful MDE. Ask what lift would justify implementation effort, not what lift looks exciting on a slide.
  • Prefer balanced allocation when possible. A 50/50 split usually reaches conclusions more efficiently.
  • Do not stop the test the moment significance appears. Early stopping without a sequential testing framework can inflate error rates.
  • Validate instrumentation before launch. Sample size cannot rescue broken event tracking.
  • Segment carefully. If you plan to analyze mobile, desktop, and country-level subgroups separately, each segment may need its own effective sample planning.

Common mistakes to avoid

Picking an MDE that is too small

Teams often set a tiny MDE because they want to catch every possible win. In reality, this can produce tests so large they never finish. A better approach is to prioritize changes with the potential for meaningful impact or to focus on higher-traffic funnel steps.

Ignoring practical significance

Statistical significance is not the same as business significance. A lift may be statistically detectable but too small to matter after engineering cost, support overhead, or long-term brand effects are considered.

Using the wrong primary metric

If the experiment is meant to improve purchases, but the sample size is planned around clicks because clicks are easier to move, you may end up with a successful click result and no meaningful downstream gain. Align your formula with the decision metric.

Trusted references for deeper reading

If you want a stronger statistical foundation, review methodological guidance from authoritative public institutions. These are especially helpful for understanding confidence intervals, hypothesis testing, and experimental design:

Final takeaway

The A/B test sample size calculator formula gives you a disciplined way to connect business goals with statistical rigor. By specifying a baseline conversion rate, an MDE you truly care about, a confidence level, and a power target, you can estimate the users required in each group before a test starts. This protects your team from underpowered experiments, rushed conclusions, and expensive misreads.

Use the calculator above as a planning instrument, not just a mathematical curiosity. If the sample size is too large for your available traffic, that is valuable information. It may mean you should increase the expected effect size, simplify the hypothesis, target a higher-volume audience, or switch to a more sensitive metric. Good experimentation begins long before the first user sees a variant. It begins with a realistic sample size plan.

Leave a Reply

Your email address will not be published. Required fields are marked *