A/B Sample Size Calculator
Estimate how many visitors you need per variation before launching an A/B test. Enter your baseline conversion rate, minimum detectable effect, confidence level, power, and monthly traffic to get a statistically grounded sample size and projected test duration.
Calculator Inputs
Results
Expert Guide to Using an A/B Sample Size Calculator
An A/B sample size calculator helps you answer one of the most important questions in experimentation: how much traffic do you need before you can trust the result of a test? Many teams spend time designing new landing pages, rewriting calls to action, or changing pricing layouts, but still struggle with test planning. The biggest mistake is often not creative quality or engineering speed. It is running an experiment without enough data. A reliable sample size estimate keeps your test from ending too soon, overreacting to noise, or drawing false conclusions from random fluctuations.
At a practical level, an A/B sample size calculator estimates how many users each variant needs in order to detect a meaningful difference in conversion rate. That estimate depends on a small set of inputs: your current baseline conversion rate, the minimum detectable effect you care about, your confidence level, and your desired statistical power. When used correctly, these settings define the balance between speed and rigor. When used poorly, they can create a test that is either too weak to detect a real lift or so demanding that it becomes unrealistic to run.
Why sample size matters in experimentation
Every A/B test is influenced by random variation. If one version converts at 5.1% and another converts at 5.4% after a few hundred sessions, that gap might be real, or it might disappear after more traffic arrives. Sample size calculation gives you a disciplined way to decide how much evidence is enough before comparing versions.
Insufficient sample sizes lead to three common problems:
- False positives: You think a variant won, but the observed gain was random noise.
- False negatives: A real improvement exists, but your test was too underpowered to detect it.
- Volatile decisions: Results swing day by day, encouraging teams to stop tests early based on incomplete evidence.
When stakeholders ask why an experiment needs several weeks of traffic, the answer is that statistical certainty is purchased with observations. More confidence and more power require more visitors. Smaller expected uplifts also require more visitors because subtle improvements are harder to detect than large ones.
The key inputs explained
The first input is your baseline conversion rate. This is the current probability that a user completes your target action under the control experience. If your signup form converts 5% of visitors, that becomes the starting point. Baseline rate matters because variability in binomial conversion data changes as conversion probability changes. Tests around very low rates can require surprisingly large samples if the effect size is small.
The second input is the minimum detectable effect, often shortened to MDE. This is the smallest improvement that would justify shipping the new version. For example, if your baseline is 5% and your MDE is a 10% relative lift, the calculator tests whether version B can reliably detect a move from 5.00% to 5.50%. Teams often choose an MDE that is too optimistic, which makes a test look faster on paper but less useful in practice. If a 2% uplift would matter commercially, then your test design should reflect that reality, even if it means a larger required sample.
The third input is the confidence level, which controls your tolerance for Type I error, the risk of declaring a difference when none exists. A 95% confidence level is a common standard in business experimentation. The fourth input is power, which is the probability of detecting a real effect if the true improvement is at least as large as your MDE. A power setting of 80% is standard, while 90% is stricter and requires more data.
Finally, test planners should account for monthly eligible traffic and traffic allocation. These do not change the statistical requirement itself, but they convert sample size into an estimated runtime. This is essential for prioritization. If a test would require three months to detect a tiny improvement, you may be better off testing a bigger change or focusing on a higher traffic page.
How the calculator estimates sample size
For a classic two-variant A/B test with equal allocation, the calculator uses a standard two-proportion sample size formula. It compares the baseline conversion rate for version A against the expected conversion rate for version B based on the minimum detectable effect. The calculation relies on normal approximations and z-scores associated with your chosen confidence level and power.
- Set the baseline conversion rate for the control group.
- Translate the MDE into an expected treatment conversion rate.
- Determine the z-score for your confidence level and the z-score for your desired power.
- Apply the two-sample proportion formula to estimate visitors needed per variant.
- Multiply by two for total sample size, then divide by monthly test traffic to estimate runtime.
This process is widely used in CRO, product experimentation, and digital analytics because it gives a practical approximation for planning. It is most appropriate for binary conversion outcomes such as purchases, signups, lead submissions, or completed onboarding steps.
Interpreting the output correctly
When you run the calculator, the most important number is the required sample size per variant. If the output says 15,000 visitors per variation, then version A needs about 15,000 and version B needs about 15,000, for roughly 30,000 total. That estimate assumes the test is run cleanly, traffic is randomized properly, and the final evaluation follows the planned stopping rule.
The estimated runtime is also important, but it should not be treated as a guaranteed completion date. Seasonal effects, weekday behavior, marketing campaigns, or traffic spikes can all alter actual duration. In addition, teams should avoid stopping exactly when the minimum sample threshold is reached if there are obvious operational anomalies, instrumentation problems, or severe imbalance in visitor quality between groups.
Another important point is that the calculator does not guarantee business significance. A statistically significant result can still be too small to matter financially. That is why choosing a realistic MDE is so important. An experiment should be designed around decisions, not just around p-values.
Sample size changes dramatically with smaller effects
One of the most useful planning insights is that sample size grows rapidly as the minimum detectable effect gets smaller. Detecting a 20% relative lift is much easier than detecting a 5% relative lift. This is why broad redesigns, pricing tests, checkout friction removal, or audience-specific offers often make stronger early candidates than tiny cosmetic changes.
| Baseline Conversion Rate | MDE | Expected Variant Rate | Approx. Sample per Variant | Total Sample |
|---|---|---|---|---|
| 5.0% | 5% relative lift | 5.25% | 118,000 | 236,000 |
| 5.0% | 10% relative lift | 5.50% | 31,000 | 62,000 |
| 5.0% | 15% relative lift | 5.75% | 14,000 | 28,000 |
| 5.0% | 20% relative lift | 6.00% | 8,000 | 16,000 |
The figures above are rounded examples using common assumptions around 95% confidence and 80% power. They show a planning reality many teams underestimate: reducing the effect size you want to detect can multiply your traffic requirement several times over. If you have a modest traffic site, this should shape your experimentation roadmap.
Confidence level and power tradeoffs
Confidence and power settings define how cautious you want to be. Tightening either setting will increase the required sample size. Most organizations use 95% confidence and 80% power because it offers a practical balance between risk and speed. Heavily regulated environments, mission-critical product changes, or expensive rollout decisions may justify stricter settings.
| Confidence | Power | Approx. Sample per Variant for 5.0% to 5.5% | Planning Interpretation |
|---|---|---|---|
| 90% | 80% | 24,000 | Faster, but more tolerant of false positives |
| 95% | 80% | 31,000 | Common default for commercial A/B testing |
| 95% | 90% | 41,000 | Better at detecting true effects, slower to finish |
| 99% | 90% | 59,000 | Very strict standard, often impractical for low traffic tests |
These examples illustrate why sample size should be discussed before implementation begins. Product managers, analysts, and executives often align quickly when they can see how testing standards translate into time-to-decision.
Common mistakes when using an A/B sample size calculator
- Using all site traffic instead of eligible traffic: only include visitors who can truly enter the experiment and reach the measured conversion opportunity.
- Picking an unrealistic MDE: setting a very large MDE can make a test appear feasible while missing smaller but still valuable wins.
- Ignoring seasonality: weekly cycles, sales periods, and campaign traffic can distort experiment timing and behavior.
- Stopping early: peeking at results and ending a test the moment one variant looks ahead increases error rates.
- Confusing statistical significance with business impact: a significant result may still have weak revenue implications.
A disciplined experimentation culture treats sample size as part of the test design, not as an afterthought. It is better to reject a weak test plan before launch than to spend engineering and design effort on an experiment that can never produce a trustworthy decision.
Best practices for running stronger experiments
- Base your baseline conversion rate on recent, representative data rather than old averages.
- Choose an MDE tied to economics, such as expected revenue lift, lead value, or retention impact.
- Use 95% confidence and 80% power as a sensible default unless your organization has a documented standard.
- Allocate enough traffic to finish in a reasonable time, but ramp carefully if there is operational risk.
- Document your stopping rule before the experiment starts.
- Check instrumentation, assignment logic, and analytics tagging before reading performance results.
- Review segment-level quality after the test, but avoid uncontrolled multiple comparison fishing.
These practices increase trust in outcomes and make your optimization program more scalable. Teams that consistently pre-register assumptions, calculate sample size, and evaluate tests against a clear decision framework tend to learn faster over time.
Authoritative resources for experimentation and statistics
If you want to deepen your understanding of evidence quality, significance testing, and research design, these public resources are excellent starting points:
- National Institute of Standards and Technology (NIST) for statistical engineering and measurement guidance.
- U.S. Census Bureau for survey methodology, sampling concepts, and practical discussions of sample design.
- Penn State University Online Statistics Education for accessible explanations of hypothesis testing, power, and sample size.
While not specific to every commercial A/B testing platform, these sources are useful because they explain the statistical foundations behind planning decisions. Understanding those foundations makes you less dependent on black-box tools.
Final takeaway
An A/B sample size calculator is not just a convenience tool. It is a planning framework that protects your team from weak conclusions. By defining your baseline, selecting a realistic minimum detectable effect, setting appropriate confidence and power, and converting the result into a practical runtime estimate, you can decide whether a proposed experiment is worth running before you spend valuable traffic on it.
For growth teams, product managers, CRO specialists, and analysts, this turns experimentation into a more disciplined operating system. Better sample size planning means fewer misleading wins, fewer inconclusive tests, and more confidence when a result genuinely deserves to be shipped.