AB Test Power Calculator
Estimate the sample size needed for a statistically reliable A/B test. Enter your baseline conversion rate, minimum detectable lift, significance level, target power, and daily traffic to understand how many visitors you need per variant and how long the experiment may take.
Required sample size by detectable lift
Expert Guide to Using an AB Test Power Calculator
An A/B test power calculator helps you answer one of the most practical questions in experimentation: how much traffic do you need before a result is credible? Teams often launch tests quickly, but many of those tests are underpowered. That means they do not have enough observations to reliably detect the effect size that actually matters to the business. When a test is underpowered, you raise the chance of missing a real improvement, wasting time on inconclusive results, and making decisions from noise rather than evidence.
This calculator is designed around a classic two-sample proportion framework. In plain English, it assumes you are comparing two conversion rates, one for the control experience and one for the variation. You enter a baseline conversion rate, choose the minimum detectable lift you care about, define a significance level, and select a target power. The output estimates the visitors required per variant, total sample size, expected test duration, and a visual chart showing how sample needs rise or fall as the detectable lift changes.
What statistical power means in A/B testing
Statistical power is the probability that your experiment will detect a true effect if that effect actually exists at the size you care about. If your power is 80%, it means your test has an 80% chance of detecting the chosen minimum detectable effect, assuming that effect is real and the model assumptions are reasonably valid. The complement, 20%, is the risk of a Type II error, which is the chance of missing a real difference.
Power matters because business testing usually has costs. Every additional day in a test can delay product rollouts, pricing changes, landing page improvements, and feature decisions. At the same time, stopping early or planning too little traffic means your result may never be strong enough to trust. A power calculator helps balance those tradeoffs up front.
Key idea: sample size grows very quickly when your minimum detectable effect gets smaller. Detecting a 5% relative lift usually needs far more traffic than detecting a 20% lift.
The four inputs that drive your sample size
1. Baseline conversion rate
The baseline conversion rate is your expected control performance. For an ecommerce checkout test, that may be purchase rate. For a lead generation page, it may be form completion rate. For a product onboarding experiment, it may be signup activation rate. This number matters because binomial variance changes with the underlying probability. Conversion rates around the middle of the range often produce more variance than rates near 0% or 100%.
2. Minimum detectable effect or lift
The minimum detectable effect, often called MDE, is the smallest improvement worth detecting. This is not simply a statistical setting. It is a strategic choice. If a 1% lift would not change your roadmap or revenue enough to matter, there is no need to power the test around that tiny effect. On the other hand, if you only power for very large improvements, you might miss realistic wins that compound over time.
3. Significance level or alpha
Alpha is the probability of a false positive under the null hypothesis. In most product and marketing experiments, 0.05 is the default. Lowering alpha to 0.01 makes the evidence threshold more strict, which can be useful in high-risk decisions, but it also increases the required sample size. If your team runs many tests and acts aggressively on winners, keeping alpha disciplined becomes especially important.
4. Power
Common targets are 80% and 90%. An 80% powered test is often considered the practical standard. A 90% powered test is more conservative and better at detecting real effects, but it usually takes longer. If traffic is scarce and test speed matters, 80% may be a reasonable compromise. If the decision has large financial or user experience consequences, 90% may be worth the extra time.
How the calculator works
This AB test power calculator uses an approximate closed-form sample size formula for two independent proportions. It converts your baseline conversion rate into a decimal probability, applies your selected relative lift to estimate the variant conversion rate, and then computes the per-group sample size based on:
- the z critical value implied by your significance level
- the z value associated with your chosen power
- the variance of the control and variant conversion rates
- the absolute difference between those rates
For planning, this method is widely used because it is fast, interpretable, and close to what many experimentation platforms use as an initial estimate. It is best thought of as a design-phase calculator. Once a real test is running, your final analysis should still follow a clearly defined decision framework and account for any deviations from the original plan.
| Common setting | Alpha | Power | Z critical value | Z for power | Typical use |
|---|---|---|---|---|---|
| Standard business experiment | 0.05 | 0.80 | 1.96 for two-sided | 0.84 | General product and conversion optimization tests |
| More conservative detection | 0.05 | 0.90 | 1.96 for two-sided | 1.28 | High-impact UX or revenue experiments |
| Strict false positive control | 0.01 | 0.80 | 2.576 for two-sided | 0.84 | High-risk decisions where false wins are costly |
Example: why MDE changes everything
Suppose your current conversion rate is 10%, your alpha is 0.05, your target power is 80%, and you are running a balanced A/B test. The table below shows approximate per-variant sample sizes required to detect different relative lifts. These values are generated using the same two-proportion planning logic used in this calculator.
| Baseline conversion rate | Relative lift | Variant conversion rate | Absolute difference | Approx. sample per variant |
|---|---|---|---|---|
| 10.0% | 5% | 10.5% | 0.5 percentage points | 57,696 |
| 10.0% | 10% | 11.0% | 1.0 percentage point | 14,734 |
| 10.0% | 15% | 11.5% | 1.5 percentage points | 6,686 |
| 10.0% | 20% | 12.0% | 2.0 percentage points | 3,836 |
This table explains why many teams struggle with experimentation on low traffic properties. A small, realistic lift can be economically meaningful, but it may also require a much longer test than stakeholders expect. If you know your traffic ceiling, a power calculator helps you pick feasible tests instead of launching experiments that are almost guaranteed to end as no decision.
How to choose a realistic MDE
A good MDE is tied to business value. Start by asking what lift would make the change worth shipping. For example, if a landing page gets 100,000 monthly visits and converts at 8%, then a 5% relative lift would move conversion to 8.4%. That is 400 extra conversions per 100,000 visits. Depending on your average order value or lead value, that may be very meaningful. If the impact is minor, a larger MDE may be more practical.
- Estimate the current baseline as accurately as possible from recent, stable data.
- Translate possible uplifts into revenue, margin, retention, or lead value.
- Select the smallest effect that would change a real decision.
- Check whether the required sample size fits your traffic and timeline.
- If it does not fit, test a bolder change or a higher impact metric.
Common mistakes teams make
Stopping too early
If you peek every day and stop as soon as one variant looks good, your false positive rate can rise beyond the nominal alpha. Planning the sample size before launch and respecting the decision rule reduces this risk.
Using unrealistic baseline rates
If you base your test on a seasonal spike, a paid campaign surge, or stale historical performance, your planning numbers can be badly off. Use recent traffic and conversion behavior that match the expected test conditions.
Choosing an MDE that is too small
Many teams say they want to detect tiny uplifts because every bit of conversion helps. That may be true economically, but if your site traffic cannot support the sample size, the test is not feasible. It is better to make a stronger product or creative change that has a chance of producing a larger effect.
Ignoring practical significance
A result can be statistically significant and still be unimportant to the business. Power calculators help you think in advance about effect size, not only p-values. That shift is one of the healthiest habits in experimentation.
Interpreting calculator results responsibly
The output of a power calculator is a planning estimate, not a guarantee. Real-world tests involve missing data, bot traffic, segmented audiences, triggered exposures, unequal splits, novelty effects, and metric definitions that can all affect final inference. Use the estimate as a strong starting point, then validate your instrumentation and experiment design before launch.
- If the estimated duration is too long, consider simplifying the audience or increasing test traffic.
- If required sample size is too high, revisit the MDE and test a larger expected change.
- If your business risk is high, consider stricter alpha or higher power and accept longer runtime.
- If your metric is rare, look for a leading indicator with higher event frequency, while keeping business relevance.
Why two-sided tests are often the better default
A two-sided test checks for any meaningful difference, whether positive or negative. It is more conservative than a one-sided test because some statistical evidence is reserved for both tails of the distribution. In product work, two-sided tests are usually safer because changes can unexpectedly hurt behavior. One-sided tests can make sense when only one direction would matter and that choice is locked before the experiment starts, but teams should be careful not to choose one-sided tests after seeing the data.
When you might need more than a simple calculator
As your experimentation program matures, you may need more advanced methods than a basic two-proportion calculator. Examples include sequential testing, multi-armed experiments, Bayesian analysis, CUPED or variance reduction methods, guardrail metrics, and corrections for multiple comparisons. Those approaches can be powerful, but they do not eliminate the need for planning. You still need a clear idea of baseline performance, meaningful effect size, and decision risk.
Authoritative resources for deeper study
If you want to go beyond a planning calculator and study the underlying statistics, these sources are excellent starting points:
- NIST Engineering Statistics Handbook for hypothesis testing, confidence intervals, and experiment design foundations.
- Penn State STAT resources for practical explanations of inference, testing, and sample size concepts.
- NCBI Bookshelf for research references on study power, error rates, and statistical interpretation.
Final takeaway
An AB test power calculator is not just a statistics tool. It is a planning tool for better business decisions. By defining your baseline conversion rate, the smallest worthwhile improvement, your tolerance for false positives, and your desired detection power, you create a realistic frame for experimentation. That frame helps you avoid underpowered tests, cut down on ambiguous outcomes, and focus your testing roadmap on experiments that can actually change decisions.
Use the calculator above before launch, not after the fact. If the sample size is too large for your available traffic, that is valuable information. It means you should redesign the experiment, choose a larger intervention, adjust the audience, or reconsider which metric to optimize. Strong experimentation programs win not just because they analyze results well, but because they plan tests well from the beginning.
This page provides planning estimates for educational and operational use. Final experiment analysis should follow your organization’s statistical policy, instrumentation checks, and pre-registered decision rules where appropriate.