“We ran the test for a week, got a 12% lift, and shipped it. Three months later, revenue was flat. What happened?”
This is the most common A/B testing failure story in product and marketing teams. The test looked like a clear winner. The dashboard showed green arrows. But the result was a mirage - a statistical artifact produced by checking results too early on too small a sample. The team did not calculate the sample size they actually needed before starting the test, so they had no way to know whether the data was signal or noise.
Sample size calculation is not an optional step in experimentation. It is the foundation that determines whether your test can detect a real effect or whether you are essentially flipping a coin and reading meaning into the outcome. This guide walks through exactly how to calculate the right sample size, what inputs matter, and the mistakes that cause most teams to get it wrong.
Why Sample Size Matters
Every A/B test is a statistical experiment. You are trying to determine whether the difference in behavior between two groups is caused by the change you made or is simply random variation. The sample size determines your test’s ability to distinguish between these two possibilities.
The False Positive Problem
With a small sample, random fluctuations look like real effects. A test with 200 visitors per variant will show “significant” differences constantly - not because the variants are different, but because 200 data points are not enough to average out natural variation. This is why teams that eyeball results after a few days frequently ship changes that have zero actual impact.
The statistical term for this is a Type I error - concluding there is an effect when there is not one. The standard acceptable rate is 5%, meaning one in twenty tests will produce a false positive even when everything is done correctly. But without proper sample size planning, the real false positive rate can be 20%, 30%, or higher.
The False Negative Problem
The less discussed but equally damaging mistake is the false negative - concluding there is no effect when there actually is one. If your sample is too small to detect a 5% improvement, you might discard a genuinely valuable change because the test “failed to reach significance.” This is a Type II error, and it quietly kills your optimization program by making you abandon ideas that actually work.
The relationship between these two errors is mediated by sample size. More data gives you the power to detect smaller effects while maintaining a low false positive rate. The question is: how much data is enough?
The Inputs You Need
A sample size calculation requires three inputs. Each one represents a decision you need to make about what your test is designed to detect.
1. Baseline Conversion Rate
This is the current conversion rate for the metric you are testing. If your checkout completion rate is 3.2%, that is your baseline. The baseline matters because it affects the variability in your data - metrics close to 0% or 100% have less variability than metrics near 50%, which changes how many observations you need.
Get this number from your analytics platform, not from memory or rough estimates. In KISSmetrics reports, you can pull the exact conversion rate for any event over your desired time period. Use at least 30 days of data to account for weekly cycles and seasonal variation.
2. Minimum Detectable Effect (MDE)
The MDE is the smallest improvement you want your test to be able to detect. This is a business decision, not a statistical one. If a 2% relative improvement in checkout conversion would generate $50,000 in additional annual revenue, and your test costs $5,000 in engineering time to implement, then detecting a 2% improvement is worth it. If the revenue impact of a 2% improvement is only $500, you should set a higher MDE threshold.
The smaller the MDE, the larger the sample size required. Detecting a 1% relative improvement might require 500,000 visitors per variant. Detecting a 10% relative improvement might require only 5,000. Teams that set unrealistically small MDEs end up running tests for months without reaching significance, which demoralizes the team and blocks the testing roadmap.
3. Statistical Power
Power is the probability that your test will detect a real effect of the specified MDE or larger. The industry standard is 80%, meaning your test has an 80% chance of detecting a real effect and a 20% chance of missing it. Some teams use 90% for high-stakes tests, which requires a larger sample.
Combined with the standard 5% significance level, 80% power means: if the true effect is at least as large as your MDE, you will correctly detect it 80% of the time, and you will incorrectly declare a winner when there is no real effect only 5% of the time. These are reasonable trade-offs for most business decisions.
Putting It Together
With these three inputs, the sample size formula (based on the normal approximation for proportions) produces the number of observations you need per variant. For a baseline conversion rate of 5%, an MDE of 10% relative (meaning you want to detect an improvement from 5.0% to 5.5%), and 80% power, you need approximately 31,000 visitors per variant. At 1,000 visitors per day, that is a 31-day test per variant - roughly two months total.
Knowing this number before you start is critical. It tells you whether the test is feasible given your traffic, how long it will take, and when you can make a decision. Without it, you are guessing.
Common Mistakes
Even teams that understand the theory make predictable errors in practice. These mistakes do not just reduce test quality - they systematically bias results toward false positives, which means teams ship changes that do not actually work.
Stopping Early When Results Look Good
This is the most damaging mistake in A/B testing. You launch a test on Monday. By Wednesday, variant B is up 15% with a p-value of 0.03. The PM is excited. The engineer wants to move on to the next project. Everyone agrees to “call it.”
The problem is that early results are wildly unstable. If you check a test with a true null effect (no real difference) every day and stop as soon as p < 0.05, you will “find” a significant result about 30% of the time - six times the expected false positive rate. The math behind this is well-established: it is called the problem of optional stopping, and it invalidates the statistical framework the test relies on.
Peeking Without Correction
Peeking is a milder form of early stopping. Instead of stopping the test, you just “check in” on the results periodically. The problem is the same: each peek is an implicit decision point, and each decision point inflates your false positive rate. If you are going to peek, use sequential testing methods (like the sequential probability ratio test or always-valid p-values) that adjust for multiple looks. KISSmetrics tracking gives you the underlying event data to run these analyses, but the statistical framework must account for the peeking. Our A/B test reports guide covers this in more detail.
Ignoring Multiple Comparisons
If you test one metric and get p < 0.05, there is a 5% false positive chance. If you test twenty metrics and highlight the one with p < 0.05, there is a 64% chance at least one metric will show a false positive. This is the multiple comparisons problem, and it is pervasive in A/B testing. Teams often instrument dozens of metrics, and the narrative becomes “conversion did not change, but time on page improved significantly” - which is almost certainly noise.
The fix: declare your primary metric before the test starts. Use secondary metrics for context and hypothesis generation, not for declaring winners. Our guide on multiple comparison testing explains correction methods for when you must evaluate multiple metrics.
Not Accounting for Variance in Revenue Metrics
Revenue-per-visitor has much higher variance than binary conversion metrics. A single large purchase can swing the average dramatically. This means revenue tests require significantly larger samples than conversion tests - often 5 to 10 times larger. If you calculate sample size based on conversion rate and then try to measure revenue impact, you will not have enough data for a reliable result.
Low-Traffic Strategies
Not every site has millions of monthly visitors. If you have 5,000 visitors per month, a test requiring 30,000 per variant would take a year. That is not practical. But low traffic does not mean you cannot test - it means you need to be strategic about what and how you test.
Increase Your MDE Threshold
Instead of trying to detect a 5% improvement, design tests aimed at detecting a 20% or 30% improvement. Understanding your conversion rate benchmarks helps you set realistic expectations for what constitutes a meaningful lift. This dramatically reduces the required sample size. The trade-off is that you will miss smaller effects, but if your traffic is low, the changes you test should be bold enough to produce large effects anyway. Small copy tweaks are not appropriate for low-traffic testing; fundamental redesigns are.
Use Composite Metrics
Instead of measuring a single low-frequency event (like purchases), create a composite metric that combines multiple signals of engagement. A metric like “added to cart OR started checkout OR completed purchase” will have a higher baseline rate and lower variance, which means you need fewer observations to detect a change. KISSmetrics custom metrics let you define these composite measures directly.
Run Fewer, Bigger Tests
Instead of running five small tests simultaneously (splitting your already limited traffic five ways), run one test at a time with your full traffic allocated to it. This maximizes the speed at which each test reaches significance and prevents the confusion of overlapping experiments that can interact with each other.
Consider Bayesian Methods
Bayesian A/B testing frameworks do not require fixed sample sizes or stopping rules. Instead, they continuously update the probability that each variant is better and report the probability of each variant being the winner at any point. This can be more practical for low-traffic sites because you can make decisions whenever the probability reaches an acceptable threshold. The trade-off is that Bayesian methods require more statistical sophistication to implement and interpret correctly.
Leverage Qualitative Data
When quantitative testing is impractical, supplement with qualitative methods. User behavior analytics - session recordings, heatmaps, user interviews - can help you understand why users behave the way they do, even without the statistical power to measure the size of an effect. Use qualitative insights to inform high-conviction changes that you ship without A/B testing, and reserve your limited testing capacity for decisions where the direction of the effect is genuinely uncertain.
How Do You Determine the Right Sample Size for an A/B Test?
Start with your baseline conversion rate, decide on the minimum detectable effect that would justify implementation, and set your desired power (typically 80%) and significance level (typically 5%). Plug these into a sample size calculator. The result tells you the number of visitors needed per variant. Divide by your daily traffic to get the test duration. Never launch a test without this calculation - it is the single most important step in preventing unreliable results.
What Is the Difference Between Type I and Type II Errors?
A Type I error (false positive) occurs when you conclude a variation is better when it actually is not - you ship a change that has zero real impact. A Type II error (false negative) occurs when you miss a real improvement because your sample was too small to detect it - you discard a change that would have helped. The significance level (alpha) controls the Type I error rate, while statistical power controls the Type II error rate. Balancing these two error types is a business decision: the right trade-off depends on the cost of implementing a dud change versus the cost of missing a real improvement.
What Is Power Analysis, and When Should You Use It?
Power analysis determines the sample size needed to detect an effect of a given size with a specified probability. Use it before every A/B test to ensure you have enough traffic, after a non-significant result to understand whether the test was adequately powered, and when planning your testing roadmap to estimate how many tests you can run per quarter. A power of 80% means you have an 80% chance of detecting a real effect - accepting a 20% chance of missing it. For high-stakes tests on revenue metrics, consider increasing power to 90%.
Key Takeaways
Sample size calculation is not a bureaucratic hurdle - it is the mechanism that separates real learning from self-deception in experimentation. Every test you run without calculating the required sample size is a test whose results you cannot fully trust.
The teams that get the most value from experimentation are not the ones running the most tests - they are the ones that trust their results because they did the math before they started.
Continue Reading
A/B Testing Statistical Significance: When to Call a Winner
Calling a test winner too early is the most common A/B testing mistake. This guide explains statistical significance in plain language and shows you exactly when it is safe to make a decision.
Read articleRunning Multiple A/B Tests? How to Avoid False Positives From Multiple Comparisons
Run 20 tests and at least one will show a false positive at 95% confidence. This guide explains the multiple comparisons trap and gives you practical methods to avoid it.
Read articleIntroduction to A/B Testing: How to Run Experiments That Actually Work
A/B testing is the most reliable way to improve conversion rates. But most tests fail because of poor methodology, not poor ideas. This guide shows you how to run tests that produce trustworthy results.
Read article