Statistical Significance

Statistical significance is a measure of confidence that the difference observed between test variants is real and not due to random chance, typically expressed as a percentage (e.g., 95% confidence) or a p-value threshold.

Also known as: significance level, confidence level, p-value threshold

Why It Matters

Statistical significance is the guardrail that prevents you from making decisions based on noise rather than signal. In any A/B test, there will be some difference between variants simply due to random variation in who sees each version. Statistical significance testing tells you whether the observed difference is large enough, given your sample size, to be considered a real effect rather than a lucky streak.

The standard threshold in most experimentation programs is 95% confidence (p-value of 0.05), meaning there is a 5% or lower probability that the observed difference occurred by chance. Some organizations use 90% for low-risk decisions or 99% for high-stakes changes. The key is choosing your threshold before the test starts and sticking to it.

Reaching statistical significance requires sufficient sample size and a meaningful effect size. A 0.1% difference in conversion rate might be real but would require millions of visitors to detect with confidence. This is why pre-test power analysis is important - it tells you whether your expected traffic can detect the effect size you care about within a reasonable timeframe.

How to Calculate

Statistical significance is determined by calculating a p-value using a statistical test (commonly a chi-squared test or z-test for proportions). The p-value represents the probability of observing a difference as large as the one measured if there were actually no real difference between variants. If the p-value is below your threshold (typically 0.05), the result is statistically significant. Most A/B testing tools calculate this automatically.

Industry Applications

E-commerce

An ecommerce site runs a checkout redesign test and sees a 4.2% conversion rate for the new design vs 3.8% for the control after 50,000 visitors per variant. The result is statistically significant at 95% confidence (p=0.02), and the practical impact is meaningful - an estimated $240K in annual revenue.

SaaS

A SaaS company tests a new pricing page but stops the test after only 2,000 visitors when the variant shows 15% higher signups. After restarting with proper sample size (12,000 per variant), the difference shrinks to 3% and is not statistically significant, preventing a premature rollout.

How to Track in KISSmetrics

Your A/B testing platform calculates statistical significance automatically. Use KISSmetrics for extended post-test analysis to verify that winning variants maintain their advantage over time and across user segments. Sometimes a test reaches significance on the primary metric but tells a different story when you examine user-level behavior over weeks.

Common Mistakes

  • -Peeking at test results repeatedly and stopping as soon as significance is reached, which inflates false positive rates.
  • -Treating statistical significance as proof that the difference matters practically - a statistically significant 0.02% improvement is not worth implementing.
  • -Not accounting for multiple comparisons when looking at many metrics or segments, which increases the chance of finding false positives.
  • -Confusing confidence level with the probability that the winning variant is better - a 95% confidence result does not mean there is a 95% chance the treatment is better.
  • -Ignoring statistical power - running tests that are too small to detect realistic effect sizes.

Pro Tips

  • +Set your significance threshold and minimum sample size before starting any test. Never change the rules mid-experiment.
  • +Use sequential testing methods (like always-valid p-values) if you need to check results during the test without inflating false positive rates.
  • +Consider practical significance alongside statistical significance - is the observed difference large enough to be worth the implementation cost?
  • +Apply a Bonferroni correction or similar method when testing multiple metrics or segments to control the overall false positive rate.
  • +Document the power of each test (probability of detecting the minimum effect size you care about) to understand the risk of false negatives.

Related Terms

See Statistical Significance in action

KISSmetrics tracks every user across sessions and devices so you can measure what matters. Start free - no credit card required.