Hypothesis Testing
A statistical method used to determine whether observed differences in data - such as a higher conversion rate in a test variant - are likely real or could have occurred by random chance.
Also known as: significance testing, statistical hypothesis testing
Why It Matters
Without hypothesis testing, you cannot distinguish between a genuine improvement and a lucky streak. If your new checkout page shows a 5% higher conversion rate this week, hypothesis testing tells you whether that difference is statistically meaningful or within the range of normal random variation.
Hypothesis testing protects organizations from costly false positives. Rolling out a change based on random noise can hurt performance, waste engineering time, and erode trust in the experimentation process. Rigorous testing ensures you only ship changes that are genuinely better.
The discipline of forming and testing hypotheses also improves how teams think about optimization. Instead of "let us try a bigger button," a proper hypothesis states: "We believe that making the CTA button more prominent will increase checkout completion by at least 3% because session recordings show users struggle to find the current button." This forces clarity about the expected mechanism and magnitude of improvement.
Industry Applications
A shoe retailer hypothesizes that adding customer review photos to product pages will increase add-to-cart rates. They run a controlled test with proper sample sizing and find a statistically significant 8% improvement, validating the hypothesis and justifying the feature investment.
A B2B platform hypothesizes that removing the credit card requirement from trial signup will increase trial starts by 30%+. The test shows a 45% increase in trial starts but a 20% decrease in trial-to-paid conversion. The hypothesis test on overall revenue shows no significant difference, preventing a premature rollout.
How to Track in KISSmetrics
When running experiments through KISSmetrics, define your hypothesis before the test begins. Use KISSmetrics A/B test reports to monitor both the observed difference and the statistical significance. Wait until the required sample size is reached before drawing conclusions. The Metrics dashboard lets you track experiment results alongside other key metrics to monitor for unintended side effects.
Common Mistakes
- -Stopping a test as soon as the result looks significant - this inflates false positive rates dramatically
- -Not defining success criteria and sample size requirements before the test starts
- -Running too many simultaneous tests without adjusting for multiple comparisons
- -Ignoring practical significance - a statistically significant 0.1% improvement may not be worth the engineering cost to maintain
- -Testing without a clear hypothesis, which makes it impossible to learn from results whether they are positive, negative, or neutral
Pro Tips
- +Always calculate the required sample size before launching a test using a power analysis calculator
- +Pre-register your hypothesis, primary metric, and test duration to prevent post-hoc rationalization of results
- +Run experiments for full business cycles (at least one full week) to account for day-of-week effects
- +Track guardrail metrics (metrics that should not get worse) alongside your primary success metric
- +Document every experiment with its hypothesis, results, and learnings in a shared experiment log
Related Terms
Null Hypothesis
The default assumption in a statistical test that there is no meaningful difference between the control and test groups - any observed difference is due to random chance rather than a real effect.
P-Value
The probability of observing a result as extreme as the one measured, assuming the null hypothesis is true. A small p-value (typically below 0.05) suggests the observed difference is unlikely due to chance alone.
Confidence Level
The percentage probability that a confidence interval calculated from a given experiment will contain the true population parameter, commonly set at 90%, 95%, or 99% in A/B testing.
Statistical Power
The probability that a test will correctly detect a real effect when one exists, typically set at 80% as a minimum standard. Higher power means a lower chance of missing genuine improvements.
Minimum Detectable Effect
The smallest difference between control and variant that a test is designed to reliably detect, given its sample size, significance level, and desired statistical power.
See Hypothesis Testing in action
KISSmetrics tracks every user across sessions and devices so you can measure what matters. Start free - no credit card required.