Statistical Power
The probability that a test will correctly detect a real effect when one exists, typically set at 80% as a minimum standard. Higher power means a lower chance of missing genuine improvements.
Also known as: test power, 1 minus beta
Formula
n = (Z_alpha/2 + Z_beta)^2 * (p1(1-p1) + p2(1-p2)) / (p2 - p1)^2
Why It Matters
Statistical power is the experimentation equivalent of sensitivity. A test with 80% power will detect a real improvement 80% of the time and miss it 20% of the time. A test with 50% power is a coin flip - you might as well not run it.
Power analysis should happen before every experiment, not after. It tells you how many users you need and how long the test needs to run to have a reasonable chance of detecting the effect you care about. Without this step, you are guessing whether your test will be informative.
Low-powered tests are worse than no test at all. They consume traffic (opportunity cost), consume time (team attention), and produce inconclusive results that do not improve decision-making. If you cannot achieve adequate power for a test, you are better off spending those resources elsewhere.
How to Calculate
Power is a function of four variables: sample size, effect size (the minimum improvement you want to detect), significance level (alpha), and baseline variance. Increase any of the first three and power increases. Online power calculators and built-in experimentation tools handle the math. The standard minimum is 80% power, meaning a 20% chance of a Type II error.
Required Sample Size (per group, for proportions) Calculator
n = (Z_alpha/2 + Z_beta)^2 * (p1(1-p1) + p2(1-p2)) / (p2 - p1)^2
Industry Applications
A mid-size retailer with 50,000 monthly visitors calculates that detecting a 5% relative lift in their 3% checkout conversion requires 85,000 visitors per variant at 80% power. This means a 7-week test, so they decide to test a bolder redesign expected to produce at least a 15% lift, reducing the required duration to under 2 weeks.
A B2B platform with only 500 new trials per month realizes they need 4,200 per variant to detect a 10% improvement in activation at 80% power. Instead of running an underpowered test, they focus on qualitative user testing and make changes based on direct user feedback.
How to Track in KISSmetrics
Before launching experiments in KISSmetrics, use a sample size calculator to determine the traffic needed for 80% power at your desired minimum detectable effect. Compare this against your actual traffic volume to estimate test duration. If the required duration exceeds 4-6 weeks, consider testing a bolder change (larger expected effect) or focusing on higher-traffic pages.
Common Mistakes
- -Not calculating power before the test, leading to underpowered experiments that waste time and traffic
- -Confusing power with confidence level - they control different types of errors
- -Setting power too high (99%) which requires enormous sample sizes and very long test durations
- -Ignoring the relationship between effect size and power - smaller effects need much larger samples to detect
Pro Tips
- +Use 80% power as your default and reserve 90% power for high-stakes tests where missing a real effect is very costly
- +When traffic is limited, increase power by testing bolder changes with larger expected effects rather than subtle tweaks
- +Track your experimentation program's historical detection rate - if fewer than 70% of "promising" tests reach significance, your tests may be systematically underpowered
- +Share power analysis results with stakeholders so they understand test timelines and the tradeoff between speed and reliability
Related Terms
Type II Error
A false negative in hypothesis testing - failing to reject the null hypothesis and concluding that a change had no effect when it actually did produce a real improvement.
Effect Size
A quantitative measure of the magnitude of a difference between groups in an experiment, independent of sample size. It answers the question "how big is the improvement?" rather than "is there an improvement?"
Minimum Detectable Effect
The smallest difference between control and variant that a test is designed to reliably detect, given its sample size, significance level, and desired statistical power.
Confidence Level
The percentage probability that a confidence interval calculated from a given experiment will contain the true population parameter, commonly set at 90%, 95%, or 99% in A/B testing.
Hypothesis Testing
A statistical method used to determine whether observed differences in data - such as a higher conversion rate in a test variant - are likely real or could have occurred by random chance.
See Statistical Power in action
KISSmetrics tracks every user across sessions and devices so you can measure what matters. Start free - no credit card required.