AB Testing Calculate Sample Size
Estimate how many visitors you need before launching an A/B test with confidence. Enter your baseline conversion rate, minimum detectable effect, significance level, power, and traffic to get a practical sample size and test duration estimate.
Your results will appear here
Enter your test assumptions and click the calculate button to estimate visitors needed per variant, total sample size, expected duration, and the implied treatment conversion rate required to detect a lift with your chosen confidence and power.
Sample size sensitivity chart
How to calculate sample size for A/B testing the right way
When marketers, product managers, and growth teams search for “ab testing calculate sample size,” they are usually trying to answer one high stakes question: how much traffic is enough to trust an experiment result? If your sample is too small, random noise can look like a breakthrough. If your sample is too large, you waste time, opportunity cost, and traffic that could have gone to a better variation. Sample size planning is the bridge between a good experiment idea and a statistically defensible decision.
At its core, A/B test sample size depends on five practical assumptions: your baseline conversion rate, the minimum detectable effect you care about, your significance level, your desired power, and your traffic allocation. Change any one of those inputs and the answer can move dramatically. A tiny expected lift often requires a very large sample. A lower baseline can also push required traffic higher because conversions are rarer and harder to distinguish from chance variation.
The calculator above uses a standard two-proportion framework, which is the most common setup for website conversion experiments. In plain English, it estimates how many users each variant needs so you can confidently detect a difference between the current conversion rate and an improved version. Most teams use a two-sided test, a 95 percent confidence threshold, and 80 percent power, but there are valid reasons to adjust these settings depending on business risk and experimentation maturity.
The key inputs that control A/B testing sample size
- Baseline conversion rate: This is your current performance, such as a 5 percent signup rate. You should use recent, stable historical data.
- Minimum detectable effect: Often shortened to MDE, this is the smallest lift that would justify a product or marketing change. It might be a 10 percent relative uplift or an absolute increase of 0.5 percentage points.
- Significance level: Commonly set at 0.05. This controls the probability of a false positive, also called a Type I error.
- Power: Commonly set at 0.80. This is your probability of detecting a true effect if the effect really exists.
- Traffic split: Most tests use a 50 percent and 50 percent split because balanced allocation is statistically efficient.
Here is why this matters in practice. Suppose your baseline conversion rate is 5 percent and you care about detecting a 10 percent relative lift, which means detecting a rise to 5.5 percent. That difference sounds small, but small differences demand substantial traffic. Many teams underestimate this and end tests after only a few days, then act on unstable results. Sample size planning prevents that mistake before the test starts.
The formula behind the calculator
For conversion rate experiments, the standard approximation for required visitors per variant compares two proportions: the control rate and the expected treatment rate. The formula combines the critical value associated with your alpha level and the critical value associated with your chosen power. It also accounts for the baseline proportion and the target treatment proportion. While you do not need to memorize the full equation, you should understand the implications:
- A smaller effect size produces a larger required sample.
- Higher confidence produces a larger required sample.
- Higher power produces a larger required sample.
- Balanced traffic allocation usually minimizes required total traffic.
- Very low baseline rates can create long test durations because conversions are sparse.
| Standard setting | Typical value | Interpretation | Z value used |
|---|---|---|---|
| Confidence level | 95% | Accepts a 5% false positive rate | 1.96 for two-sided tests |
| Confidence level | 99% | More conservative decision threshold | 2.58 for two-sided tests |
| Power | 80% | 20% chance of missing a true effect | 0.84 |
| Power | 90% | Lower false negative risk | 1.28 |
These values are not arbitrary. A two-sided 95 percent confidence threshold corresponds to a z critical value of about 1.96, while 80 percent power corresponds to about 0.84. In combination, these thresholds shape how much evidence you need before calling a winner. If the business cost of shipping a false winner is high, such as pricing, legal copy, or checkout flow changes, teams may use stricter settings. If the test is exploratory and low risk, they may keep standard defaults.
Example calculation with realistic numbers
Imagine an ecommerce site with a baseline purchase rate of 4 percent. The team wants to detect at least a 12.5 percent relative uplift, which means an increase to 4.5 percent. They choose alpha at 0.05 and power at 0.80. Under these assumptions, the required sample size is roughly in the tens of thousands per variant. If the site can send 10,000 eligible users per day into the test at a 50 and 50 split, the experiment may need around one to two weeks, depending on the exact parameters and whether traffic quality is stable every day.
Now compare that to a baseline of 20 percent and the same relative uplift. A shift from 20 percent to 22.5 percent is much larger in absolute percentage points than a shift from 4 percent to 4.5 percent. Because the gap between control and treatment is easier to detect, the sample size requirement often drops materially. This is why teams should always think in both relative and absolute terms. Relative lifts sound intuitive, but the absolute gap is what ultimately drives detectability.
| Baseline rate | Target uplift | Treatment rate | Approx. visitors per variant | Approx. total visitors |
|---|---|---|---|---|
| 3.0% | 10% relative | 3.3% | about 38,000 | about 76,000 |
| 5.0% | 10% relative | 5.5% | about 31,000 | about 62,000 |
| 10.0% | 10% relative | 11.0% | about 14,700 | about 29,400 |
| 20.0% | 10% relative | 22.0% | about 6,400 | about 12,800 |
Those approximate figures assume a balanced two-variant test, 95 percent confidence, and 80 percent power. They are directionally useful because they reveal a common pattern: as the baseline rate rises, a fixed relative uplift often becomes easier to detect in absolute terms, reducing the required sample. This is one reason low conversion funnels, like demo requests or purchases on expensive items, usually need much longer tests than higher frequency micro-conversions like button clicks or email captures.
Why many A/B tests are underpowered
One of the biggest reasons teams get misleading results is that they stop tests too early. A few days of noisy data may show a temporary 20 percent lift, but that does not mean the real effect is 20 percent. Short tests are especially vulnerable to weekday and weekend shifts, campaign mix changes, seasonality, and returning user behavior. Underpowered tests tend to overestimate winners and produce disappointing post-launch outcomes.
Another common problem is choosing an unrealistic MDE. If your team says it only cares about a 1 percent relative improvement at a 2 percent baseline, the required sample may be impractically large. In that case, the right answer is not to ignore statistics. The right answer is to rethink the test. Perhaps you should target a bigger design change, narrow the audience to a higher intent segment, or measure a more frequent leading indicator before validating on the final business metric.
How to choose a sensible minimum detectable effect
The best MDE is not the smallest lift you can imagine. It is the smallest lift worth shipping after considering implementation cost, downside risk, and expected business value. For example:
- If a new checkout design requires major engineering effort, you may want a larger minimum lift before rollout.
- If a homepage headline test is cheap and reversible, you may accept a smaller MDE.
- If the experiment affects pricing, trust signals, or compliance-sensitive content, you may want stricter thresholds.
One practical method is to convert the lift into revenue or lead value. If a 5 percent relative increase on your baseline conversion rate only adds a negligible amount of expected value, then designing a test to detect that tiny effect may not be worth the traffic. Good experiment design starts with economic materiality, not only statistical possibility.
One-sided vs two-sided tests
Most experimentation programs should default to two-sided tests. A two-sided setup asks whether the variant is different from control, better or worse. This is more conservative and protects you when a change unexpectedly harms performance. A one-sided test asks only whether the variant improves the metric and can reduce the required sample size slightly. However, one-sided tests should be used carefully and pre-registered before looking at data. Switching to one-sided after seeing results is poor statistical practice.
Interpreting duration estimates
The calculator converts required total visitors into an estimated number of days based on your eligible daily traffic and chosen split. Treat this as a planning estimate, not a hard promise. In real experiments, duration should also account for:
- Full business cycles, including weekday and weekend behavior.
- Campaign calendars and promotions that can distort traffic quality.
- Cookie loss, bot filtering, and analytics latency.
- Segment exclusions and experiment overlap rules.
- Practical monitoring windows for implementation bugs and guardrail metrics.
As a rule of thumb, many teams avoid ending a test before at least one or two full business cycles have passed, even if the raw traffic target is reached quickly. This protects against falsely declaring significance because of temporary traffic composition changes.
What authoritative statistical sources say
If you want a deeper statistical foundation, review the guidance from the NIST/SEMATECH e-Handbook of Statistical Methods, which covers hypothesis testing and design principles used across applied statistics. For educational explanations of power, significance, and inference, Penn State’s online statistics program is also useful at online.stat.psu.edu. If your experiments affect health, public services, or regulated communication, evidence standards are especially important, and the National Institutes of Health provides extensive research resources on study design and interpretation.
Common mistakes when calculating A/B test sample size
- Using stale baseline data: If your baseline comes from a different season, campaign mix, or audience, the estimate can be wrong from day one.
- Mixing relative uplift with absolute uplift: A 10 percent relative lift is not the same as a 10 percentage point increase.
- Ignoring power: Confidence gets more attention, but power is what protects you from false negatives.
- Changing metrics mid-test: Sample size was planned around a defined metric. Swapping the primary metric undermines validity.
- Peeking without a plan: Repeatedly checking significance and stopping early inflates false positive risk.
- Overlooking segmentation: If you care about mobile users only, calculate on mobile traffic, not total site traffic.
Best practices for teams running serious experiments
Professional experimentation programs document assumptions before launch. They define the primary metric, baseline period, MDE rationale, significance threshold, power, traffic allocation, expected duration, and guardrail metrics. They also predefine rules for excluding low quality traffic and for handling implementation issues. This level of discipline makes results more credible and easier to socialize across stakeholders.
It is also smart to pair statistical significance with business significance. A tiny uplift can become statistically significant if you have enough traffic, but that does not always mean it matters. Likewise, a meaningful economic lift may fail to reach significance if the test was underpowered. The best decision process looks at both the confidence of the result and the value of the change.
Final takeaway
To calculate sample size for A/B testing well, start with realistic assumptions, not optimistic guesses. Use recent baseline conversion data, choose an MDE that matters to the business, keep confidence and power aligned with risk, and estimate duration using truly eligible traffic. If the sample size is too large to be practical, that is a signal to redesign the test, narrow the audience, or choose a more impactful change. Good experimentation is not just about finding winners. It is about making decisions that are both statistically defensible and commercially useful.
This page provides planning estimates for educational and operational use. For highly regulated environments or complex multi-variant designs, consult a statistician or experimentation specialist.