Type II Error
A false negative in hypothesis testing - failing to reject the null hypothesis and concluding that a change had no effect when it actually did produce a real improvement.
Also known as: false negative, beta error
Formula
Beta = 1 - Statistical Power
Why It Matters
Type II errors are the silent killers of experimentation programs. A Type I error produces a visible failure (you ship something that does not work). A Type II error produces an invisible one (you discard an improvement that would have worked). You never know what you missed.
Type II errors are far more common than most teams realize. They typically happen because tests are underpowered - too few users or too short a runtime to detect real but modest improvements. If your minimum detectable effect is 10% but the true improvement is 5%, an underpowered test will miss it every time.
The cumulative cost of Type II errors is enormous. If you run 50 tests per year and 20% of them would have shown a genuine 3-5% improvement that your test was too small to detect, those missed opportunities compound into significant lost revenue. Proper power analysis before every test is the remedy.
How to Calculate
The probability of a Type II error (beta) equals 1 minus the statistical power of the test. If your test has 80% power, the Type II error rate is 20%. Power depends on sample size, effect size, significance level, and the variance in your data. Use a power calculator to determine the required sample size to achieve acceptable Type II error rates.
Type II Error Rate Calculator
Beta = 1 - Statistical Power
Industry Applications
A specialty food retailer tests personalized product recommendations but calls the test "no impact" after two weeks with only 2,000 users per variant. Power analysis reveals the test could only detect effects above 15%. When rerun with 15,000 users per variant, the same change shows a significant 6% revenue lift.
A developer tools company discards a simplified signup flow after a "non-significant" test. A post-mortem reveals the test had only 40% power to detect the expected 5% improvement. When rerun with proper sample sizing, the simplified flow shows a significant 4.5% improvement in trial starts.
How to Track in KISSmetrics
Prevent Type II errors by running proper power analysis before each experiment. Calculate the sample size needed to detect your minimum meaningful effect at your desired power level (typically 80%). If KISSmetrics shows insufficient traffic to reach that sample size within a reasonable timeframe, consider testing a larger effect (bolder change) or consolidating traffic to the test.
Common Mistakes
- -Running underpowered tests that have little chance of detecting real improvements and then concluding "the change had no effect"
- -Treating "not significant" as "no effect" instead of "insufficient evidence"
- -Not performing pre-test power analysis, which is the primary prevention for Type II errors
- -Ignoring the cost of missed improvements while focusing exclusively on preventing false positives
Pro Tips
- +Always calculate required sample sizes before launching tests - if you cannot reach the needed size, reconsider the test design
- +When a test shows a positive but non-significant trend, check whether the test was underpowered before discarding the idea
- +Consider running tests at 80% power for standard experiments and 90% power for tests where missing a real improvement is especially costly
- +Aggregate results from multiple small, related experiments using meta-analysis to detect effects that individual tests missed
Related Terms
Type I Error
A false positive in hypothesis testing - incorrectly rejecting the null hypothesis and concluding that a change had a real effect when the observed difference was actually due to random chance.
Statistical Power
The probability that a test will correctly detect a real effect when one exists, typically set at 80% as a minimum standard. Higher power means a lower chance of missing genuine improvements.
Effect Size
A quantitative measure of the magnitude of a difference between groups in an experiment, independent of sample size. It answers the question "how big is the improvement?" rather than "is there an improvement?"
Minimum Detectable Effect
The smallest difference between control and variant that a test is designed to reliably detect, given its sample size, significance level, and desired statistical power.
Null Hypothesis
The default assumption in a statistical test that there is no meaningful difference between the control and test groups - any observed difference is due to random chance rather than a real effect.
See Type II Error in action
KISSmetrics tracks every user across sessions and devices so you can measure what matters. Start free - no credit card required.