Statistical Significance
Statistical significance is a measure of confidence that the difference observed between test variants is real and not due to random chance, typically expressed as a percentage (e.g., 95% confidence) or a p-value threshold.
Also known as: significance level, confidence level, p-value threshold
Why It Matters
Statistical significance is the guardrail that prevents you from making decisions based on noise rather than signal. In any A/B test, there will be some difference between variants simply due to random variation in who sees each version. Statistical significance testing tells you whether the observed difference is large enough, given your sample size, to be considered a real effect rather than a lucky streak.
The standard threshold in most experimentation programs is 95% confidence (p-value of 0.05), meaning there is a 5% or lower probability that the observed difference occurred by chance. Some organizations use 90% for low-risk decisions or 99% for high-stakes changes. The key is choosing your threshold before the test starts and sticking to it.
Reaching statistical significance requires sufficient sample size and a meaningful effect size. A 0.1% difference in conversion rate might be real but would require millions of visitors to detect with confidence. This is why pre-test power analysis is important - it tells you whether your expected traffic can detect the effect size you care about within a reasonable timeframe.
How to Calculate
Statistical significance is determined by calculating a p-value using a statistical test (commonly a chi-squared test or z-test for proportions). The p-value represents the probability of observing a difference as large as the one measured if there were actually no real difference between variants. If the p-value is below your threshold (typically 0.05), the result is statistically significant. Most A/B testing tools calculate this automatically.
Industry Applications
An ecommerce site runs a checkout redesign test and sees a 4.2% conversion rate for the new design vs 3.8% for the control after 50,000 visitors per variant. The result is statistically significant at 95% confidence (p=0.02), and the practical impact is meaningful - an estimated $240K in annual revenue.
A SaaS company tests a new pricing page but stops the test after only 2,000 visitors when the variant shows 15% higher signups. After restarting with proper sample size (12,000 per variant), the difference shrinks to 3% and is not statistically significant, preventing a premature rollout.
How to Track in KISSmetrics
Your A/B testing platform calculates statistical significance automatically. Use KISSmetrics for extended post-test analysis to verify that winning variants maintain their advantage over time and across user segments. Sometimes a test reaches significance on the primary metric but tells a different story when you examine user-level behavior over weeks.
Common Mistakes
- -Peeking at test results repeatedly and stopping as soon as significance is reached, which inflates false positive rates.
- -Treating statistical significance as proof that the difference matters practically - a statistically significant 0.02% improvement is not worth implementing.
- -Not accounting for multiple comparisons when looking at many metrics or segments, which increases the chance of finding false positives.
- -Confusing confidence level with the probability that the winning variant is better - a 95% confidence result does not mean there is a 95% chance the treatment is better.
- -Ignoring statistical power - running tests that are too small to detect realistic effect sizes.
Pro Tips
- +Set your significance threshold and minimum sample size before starting any test. Never change the rules mid-experiment.
- +Use sequential testing methods (like always-valid p-values) if you need to check results during the test without inflating false positive rates.
- +Consider practical significance alongside statistical significance - is the observed difference large enough to be worth the implementation cost?
- +Apply a Bonferroni correction or similar method when testing multiple metrics or segments to control the overall false positive rate.
- +Document the power of each test (probability of detecting the minimum effect size you care about) to understand the risk of false negatives.
Related Terms
Sample Size
Sample size is the number of users or observations included in each variant of an experiment, determining the statistical power of the test and how confidently you can detect real differences between variants.
A/B Testing
A/B testing is a controlled experiment that compares two versions of a web page, email, ad, or feature by randomly splitting traffic between them and measuring which version performs better on a defined success metric.
Control Group
A control group is the subset of users in an experiment who receive the existing or unchanged experience, serving as the baseline against which the performance of test variants is measured.
Variant
A variant (also called a treatment or challenger) is an alternative version of a page, feature, or experience being tested against the control in an experiment, incorporating the specific changes hypothesized to improve performance.
Multivariate Testing
Multivariate testing (MVT) is an experimentation method that simultaneously tests multiple combinations of page elements - such as headlines, images, and CTAs - to determine which combination of changes produces the best overall result.
See Statistical Significance in action
KISSmetrics tracks every user across sessions and devices so you can measure what matters. Start free - no credit card required.