Conversion Optimization Toolkit

A/B Sample Size Calculator

Estimate how many visitors you need per variation before launching an A/B test. Enter your baseline conversion rate, minimum detectable effect, confidence level, power, and monthly traffic to get a statistically grounded sample size and projected test duration.

Calculator Inputs

Baseline conversion rate (%)

Current control conversion rate. Example: 5 means 5%.

Minimum detectable effect

Smallest lift worth detecting.

MDE type

Relative: 10% of baseline. Absolute: direct point change.

Confidence level

Higher confidence reduces false positives but increases sample size.

Statistical power

Higher power lowers false negatives.

Test type

Two-sided is standard for most experimentation programs.

Monthly eligible visitors

Traffic that can actually enter the experiment.

Traffic allocation to test (%)

Use less than 100 if you are ramping the experiment slowly.

Scenario notes

Optional label for your own planning.

Results

Enter your assumptions and click “Calculate Sample Size” to see the required visitors per variant, total test sample, uplift assumptions, and estimated runtime.

Expert Guide to Using an A/B Sample Size Calculator

An A/B sample size calculator helps you answer one of the most important questions in experimentation: how much traffic do you need before you can trust the result of a test? Many teams spend time designing new landing pages, rewriting calls to action, or changing pricing layouts, but still struggle with test planning. The biggest mistake is often not creative quality or engineering speed. It is running an experiment without enough data. A reliable sample size estimate keeps your test from ending too soon, overreacting to noise, or drawing false conclusions from random fluctuations.

At a practical level, an A/B sample size calculator estimates how many users each variant needs in order to detect a meaningful difference in conversion rate. That estimate depends on a small set of inputs: your current baseline conversion rate, the minimum detectable effect you care about, your confidence level, and your desired statistical power. When used correctly, these settings define the balance between speed and rigor. When used poorly, they can create a test that is either too weak to detect a real lift or so demanding that it becomes unrealistic to run.

Why sample size matters in experimentation

Every A/B test is influenced by random variation. If one version converts at 5.1% and another converts at 5.4% after a few hundred sessions, that gap might be real, or it might disappear after more traffic arrives. Sample size calculation gives you a disciplined way to decide how much evidence is enough before comparing versions.

Insufficient sample sizes lead to three common problems:

False positives: You think a variant won, but the observed gain was random noise.
False negatives: A real improvement exists, but your test was too underpowered to detect it.
Volatile decisions: Results swing day by day, encouraging teams to stop tests early based on incomplete evidence.

When stakeholders ask why an experiment needs several weeks of traffic, the answer is that statistical certainty is purchased with observations. More confidence and more power require more visitors. Smaller expected uplifts also require more visitors because subtle improvements are harder to detect than large ones.

The key inputs explained

The first input is your baseline conversion rate. This is the current probability that a user completes your target action under the control experience. If your signup form converts 5% of visitors, that becomes the starting point. Baseline rate matters because variability in binomial conversion data changes as conversion probability changes. Tests around very low rates can require surprisingly large samples if the effect size is small.

The second input is the minimum detectable effect, often shortened to MDE. This is the smallest improvement that would justify shipping the new version. For example, if your baseline is 5% and your MDE is a 10% relative lift, the calculator tests whether version B can reliably detect a move from 5.00% to 5.50%. Teams often choose an MDE that is too optimistic, which makes a test look faster on paper but less useful in practice. If a 2% uplift would matter commercially, then your test design should reflect that reality, even if it means a larger required sample.

The third input is the confidence level, which controls your tolerance for Type I error, the risk of declaring a difference when none exists. A 95% confidence level is a common standard in business experimentation. The fourth input is power, which is the probability of detecting a real effect if the true improvement is at least as large as your MDE. A power setting of 80% is standard, while 90% is stricter and requires more data.

Finally, test planners should account for monthly eligible traffic and traffic allocation. These do not change the statistical requirement itself, but they convert sample size into an estimated runtime. This is essential for prioritization. If a test would require three months to detect a tiny improvement, you may be better off testing a bigger change or focusing on a higher traffic page.

How the calculator estimates sample size

For a classic two-variant A/B test with equal allocation, the calculator uses a standard two-proportion sample size formula. It compares the baseline conversion rate for version A against the expected conversion rate for version B based on the minimum detectable effect. The calculation relies on normal approximations and z-scores associated with your chosen confidence level and power.

Set the baseline conversion rate for the control group.
Translate the MDE into an expected treatment conversion rate.
Determine the z-score for your confidence level and the z-score for your desired power.
Apply the two-sample proportion formula to estimate visitors needed per variant.
Multiply by two for total sample size, then divide by monthly test traffic to estimate runtime.

This process is widely used in CRO, product experimentation, and digital analytics because it gives a practical approximation for planning. It is most appropriate for binary conversion outcomes such as purchases, signups, lead submissions, or completed onboarding steps.

Interpreting the output correctly

When you run the calculator, the most important number is the required sample size per variant. If the output says 15,000 visitors per variation, then version A needs about 15,000 and version B needs about 15,000, for roughly 30,000 total. That estimate assumes the test is run cleanly, traffic is randomized properly, and the final evaluation follows the planned stopping rule.

The estimated runtime is also important, but it should not be treated as a guaranteed completion date. Seasonal effects, weekday behavior, marketing campaigns, or traffic spikes can all alter actual duration. In addition, teams should avoid stopping exactly when the minimum sample threshold is reached if there are obvious operational anomalies, instrumentation problems, or severe imbalance in visitor quality between groups.

Another important point is that the calculator does not guarantee business significance. A statistically significant result can still be too small to matter financially. That is why choosing a realistic MDE is so important. An experiment should be designed around decisions, not just around p-values.

Sample size changes dramatically with smaller effects

One of the most useful planning insights is that sample size grows rapidly as the minimum detectable effect gets smaller. Detecting a 20% relative lift is much easier than detecting a 5% relative lift. This is why broad redesigns, pricing tests, checkout friction removal, or audience-specific offers often make stronger early candidates than tiny cosmetic changes.

Baseline Conversion Rate	MDE	Expected Variant Rate	Approx. Sample per Variant	Total Sample
5.0%	5% relative lift	5.25%	118,000	236,000
5.0%	10% relative lift	5.50%	31,000	62,000
5.0%	15% relative lift	5.75%	14,000	28,000
5.0%	20% relative lift	6.00%	8,000	16,000

The figures above are rounded examples using common assumptions around 95% confidence and 80% power. They show a planning reality many teams underestimate: reducing the effect size you want to detect can multiply your traffic requirement several times over. If you have a modest traffic site, this should shape your experimentation roadmap.

Confidence level and power tradeoffs

Confidence and power settings define how cautious you want to be. Tightening either setting will increase the required sample size. Most organizations use 95% confidence and 80% power because it offers a practical balance between risk and speed. Heavily regulated environments, mission-critical product changes, or expensive rollout decisions may justify stricter settings.

Confidence	Power	Approx. Sample per Variant for 5.0% to 5.5%	Planning Interpretation
90%	80%	24,000	Faster, but more tolerant of false positives
95%	80%	31,000	Common default for commercial A/B testing
95%	90%	41,000	Better at detecting true effects, slower to finish
99%	90%	59,000	Very strict standard, often impractical for low traffic tests

These examples illustrate why sample size should be discussed before implementation begins. Product managers, analysts, and executives often align quickly when they can see how testing standards translate into time-to-decision.

Common mistakes when using an A/B sample size calculator

Using all site traffic instead of eligible traffic: only include visitors who can truly enter the experiment and reach the measured conversion opportunity.
Picking an unrealistic MDE: setting a very large MDE can make a test appear feasible while missing smaller but still valuable wins.
Ignoring seasonality: weekly cycles, sales periods, and campaign traffic can distort experiment timing and behavior.
Stopping early: peeking at results and ending a test the moment one variant looks ahead increases error rates.
Confusing statistical significance with business impact: a significant result may still have weak revenue implications.

A disciplined experimentation culture treats sample size as part of the test design, not as an afterthought. It is better to reject a weak test plan before launch than to spend engineering and design effort on an experiment that can never produce a trustworthy decision.

Best practices for running stronger experiments

Base your baseline conversion rate on recent, representative data rather than old averages.
Choose an MDE tied to economics, such as expected revenue lift, lead value, or retention impact.
Use 95% confidence and 80% power as a sensible default unless your organization has a documented standard.
Allocate enough traffic to finish in a reasonable time, but ramp carefully if there is operational risk.
Document your stopping rule before the experiment starts.
Check instrumentation, assignment logic, and analytics tagging before reading performance results.
Review segment-level quality after the test, but avoid uncontrolled multiple comparison fishing.

These practices increase trust in outcomes and make your optimization program more scalable. Teams that consistently pre-register assumptions, calculate sample size, and evaluate tests against a clear decision framework tend to learn faster over time.

Authoritative resources for experimentation and statistics

If you want to deepen your understanding of evidence quality, significance testing, and research design, these public resources are excellent starting points:

National Institute of Standards and Technology (NIST) for statistical engineering and measurement guidance.
U.S. Census Bureau for survey methodology, sampling concepts, and practical discussions of sample design.
Penn State University Online Statistics Education for accessible explanations of hypothesis testing, power, and sample size.

While not specific to every commercial A/B testing platform, these sources are useful because they explain the statistical foundations behind planning decisions. Understanding those foundations makes you less dependent on black-box tools.

Final takeaway

An A/B sample size calculator is not just a convenience tool. It is a planning framework that protects your team from weak conclusions. By defining your baseline, selecting a realistic minimum detectable effect, setting appropriate confidence and power, and converting the result into a practical runtime estimate, you can decide whether a proposed experiment is worth running before you spend valuable traffic on it.

For growth teams, product managers, CRO specialists, and analysts, this turns experimentation into a more disciplined operating system. Better sample size planning means fewer misleading wins, fewer inconclusive tests, and more confidence when a result genuinely deserves to be shipped.