Hypothesis testing is the cornerstone of statistical inference. It provides a formal, systematic way to evaluate claims about populations using sample data. From clinical trials to A/B testing, from quality control to psychological research — hypothesis testing is everywhere. This guide walks you through every step clearly.
What is a Statistical Hypothesis?
A statistical hypothesis is a claim about a population parameter (like a mean, proportion, or variance). We use sample data to test whether there is enough evidence to support or refute this claim.
Every hypothesis test involves two competing hypotheses:
Null Hypothesis (H₀): The default assumption — no effect, no difference, no relationship. Example: H₀: μ = 50. You try to disprove this.
Alternative Hypothesis (H₁ or Hₐ): What you are trying to show. Example: H₁: μ ≠ 50. This is supported if you reject H₀.
The 6 Steps of Hypothesis Testing
Step 1: State the Hypotheses
Write explicit H₀ and H₁ before collecting data. Decide whether the test is two-tailed (H₁: μ ≠ μ₀), left-tailed (H₁: μ < μ₀), or right-tailed (H₁: μ > μ₀).
Step 2: Choose the Significance Level (α)
α is the probability of rejecting H₀ when it is actually true (Type I error rate). Set α before collecting data. Common choices: α = 0.05, α = 0.01, α = 0.001 (for medical or safety research).
Step 3: Collect Data and Choose the Appropriate Test
| Situation | Test to Use |
| Test one mean (σ unknown) | One-sample t-test |
| Compare two independent means | Two-sample t-test |
| Compare before/after (paired) | Paired t-test |
| Compare 3+ group means | One-way ANOVA |
| Test a proportion | One-proportion z-test |
| Test categorical frequencies | Chi-square test |
| Non-normal 2-group comparison | Mann-Whitney U test |
| Non-normal 3+ groups | Kruskal-Wallis test |
Step 4: Calculate the Test Statistic
The test statistic measures how far your sample result is from what H₀ predicts, in standard error units. General form:
Test statistic = (Observed − Expected under H₀) / Standard Error
Step 5: Find the P-Value
The p-value is the probability of getting a test statistic this extreme or more extreme, assuming H₀ is true. Use statistical tables or a calculator. Smaller p-values provide stronger evidence against H₀.
Step 6: Make a Decision
- If p < α → Reject H₀. Results are statistically significant. There is sufficient evidence for H₁.
- If p ≥ α → Fail to reject H₀. Insufficient evidence against H₀. This does NOT mean H₀ is true.
Complete Worked Example
A nutritionist claims that a new diet reduces mean daily calorie intake below 2,000 calories. A sample of 25 participants following the diet had mean intake x̄ = 1,850 calories with SD s = 300 calories. Test at α = 0.05.
Step 1: H₀: μ = 2000 | H₁: μ < 2000 (left-tailed test)
Step 2: α = 0.05, one-tailed
Step 3: One-sample t-test (σ unknown, sample data available)
Step 4: t = (1850 − 2000) / (300/√25) = −150/60 = −2.50
Step 5: df = 24. P(T < −2.50) = 0.010
Step 6: p = 0.010 < 0.05 → Reject H₀. The diet significantly reduces calorie intake below 2,000 calories (t(24) = −2.50, p = 0.010).
Type I and Type II Errors
| Decision | H₀ is Actually True | H₀ is Actually False |
| Reject H₀ | ❌ Type I Error (α) — False Positive | ✅ Correct — True Positive (Power) |
| Fail to Reject H₀ | ✅ Correct — True Negative | ❌ Type II Error (β) — False Negative |
You cannot minimise both simultaneously. Reducing α (stricter test) increases β (more false negatives). The solution is a larger sample size — which reduces both types of error.
Statistical Power
Power = 1 − β = the probability of correctly rejecting H₀ when the alternative is true. Power ≥ 0.80 is typically required. Power increases with: larger n, larger effect size, higher α, and lower variability.
Always conduct a power analysis before your study to ensure you collect enough data to detect your expected effect. Use our Sample Size Calculator for this.
Reporting Hypothesis Test Results
Always report: the test used, test statistic, degrees of freedom, p-value, and effect size. Example APA format: t(24) = −2.50, p = .010, d = −0.50
The Logic of Hypothesis Testing
Hypothesis testing is built on a logical framework called proof by contradiction. Rather than directly proving your theory (the alternative hypothesis), you assume the opposite (null hypothesis) is true and examine whether your data is consistent with this assumption. If your data would be extremely unlikely under the null hypothesis, you reject it in favour of the alternative.
This approach was formalised by Jerzy Neyman and Egon Pearson in the 1930s as a decision framework for scientific research. It provides a systematic, reproducible way to make decisions from uncertain data.
The Five Steps of Hypothesis Testing
Step 1: State the hypotheses. The null hypothesis H₀ is your default position — usually "no effect," "no difference," or "no relationship." The alternative hypothesis H₁ is what you are trying to establish. Be specific and state hypotheses before collecting data to avoid bias.
Step 2: Set the significance level α. This is the probability of incorrectly rejecting a true null hypothesis (Type I error rate). Convention is α = 0.05, but choose based on the consequences of errors in your context. Medical trials often use α = 0.01 for drug approval.
Step 3: Choose and calculate the test statistic. The test statistic summarises your data into a single number that measures how far your observation is from what H₀ predicts. Common choices: t-statistic for means, z-statistic for proportions, F-statistic for variances, χ²-statistic for frequencies.
Step 4: Calculate the p-value and compare to α. Find the probability of observing a test statistic as extreme as yours under H₀. If p < α, reject H₀. If p ≥ α, fail to reject H₀.
Step 5: State your conclusion in context. Never say "prove" or "accept H₀." Say "reject H₀ at the 5% significance level" or "there is insufficient evidence to reject H₀."
One-Tailed vs Two-Tailed Tests
A two-tailed test detects effects in either direction. A one-tailed test is only sensitive to effects in one specified direction. Use one-tailed tests only when you have a strong prior reason to expect the effect in that direction, stated before data collection. One-tailed tests have more power (ability to detect a real effect) but at the cost of being unable to detect effects in the opposite direction.
Type I and Type II Errors
Every hypothesis test risks two types of errors. A Type I error (false positive, rate α) occurs when you reject a true null hypothesis. A Type II error (false negative, rate β) occurs when you fail to reject a false null hypothesis. Statistical power (1−β) is the probability of correctly detecting a real effect.
These errors trade off: decreasing α (making it harder to reject H₀) reduces Type I errors but increases Type II errors. The optimal balance depends on the relative costs of each error type in your application.
Power Analysis and Sample Size
Before conducting a study, power analysis helps determine the required sample size. You specify: the desired power (typically 0.80 or 0.90), significance level α, and the minimum effect size you want to detect. Underpowered studies frequently fail to detect real effects, leading to wasted resources and false negative conclusions.
A useful rule of thumb: to detect a medium effect size (Cohen's d = 0.5) with 80% power at α = 0.05 using a two-sample t-test requires approximately 64 participants per group.
Common Hypothesis Tests and When to Use Them
| Situation | Appropriate Test |
| Compare one mean to a known value, σ known | One-sample z-test |
| Compare one mean to a known value, σ unknown | One-sample t-test |
| Compare two independent group means | Two-sample t-test |
| Compare paired measurements (before/after) | Paired t-test |
| Compare 3+ group means | One-way ANOVA |
| Test independence in a contingency table | Chi-square test |
| Compare proportions | Proportion z-test |
| Non-parametric two-group comparison | Mann-Whitney U test |
Practical Example: Full Walk-Through
A pharmaceutical company claims their new painkiller reduces pain scores by more than 5 points on a 100-point scale. A trial recruits 40 patients (σ unknown, so use t-test).
H₀: μ_reduction ≤ 5 (no better than claimed) | H₁: μ_reduction > 5 (one-tailed)
Results: mean reduction = 7.3 points, s = 4.1. t = (7.3−5)/(4.1/√40) = 2.3/0.648 = 3.55
With df = 39 and α = 0.05, critical t = 1.685. Since 3.55 > 1.685, p ≈ 0.0005 < 0.05.
Conclusion: Reject H₀. There is strong statistical evidence that the painkiller reduces pain by more than 5 points on average.
The Logic of Hypothesis Testing
Hypothesis testing is built on a logical framework called proof by contradiction. Rather than directly proving your theory (the alternative hypothesis), you assume the opposite (null hypothesis) is true and examine whether your data is consistent with this assumption. If your data would be extremely unlikely under the null hypothesis, you reject it in favour of the alternative.
This approach was formalised by Jerzy Neyman and Egon Pearson in the 1930s as a decision framework for scientific research. It provides a systematic, reproducible way to make decisions from uncertain data.
The Five Steps of Hypothesis Testing
Step 1: State the hypotheses. The null hypothesis H₀ is your default position — usually "no effect," "no difference," or "no relationship." The alternative hypothesis H₁ is what you are trying to establish. Be specific and state hypotheses before collecting data to avoid bias.
Step 2: Set the significance level α. This is the probability of incorrectly rejecting a true null hypothesis (Type I error rate). Convention is α = 0.05, but choose based on the consequences of errors in your context. Medical trials often use α = 0.01 for drug approval.
Step 3: Choose and calculate the test statistic. The test statistic summarises your data into a single number that measures how far your observation is from what H₀ predicts. Common choices: t-statistic for means, z-statistic for proportions, F-statistic for variances, χ²-statistic for frequencies.
Step 4: Calculate the p-value and compare to α. Find the probability of observing a test statistic as extreme as yours under H₀. If p < α, reject H₀. If p ≥ α, fail to reject H₀.
Step 5: State your conclusion in context. Never say "prove" or "accept H₀." Say "reject H₀ at the 5% significance level" or "there is insufficient evidence to reject H₀."
One-Tailed vs Two-Tailed Tests
A two-tailed test detects effects in either direction. A one-tailed test is only sensitive to effects in one specified direction. Use one-tailed tests only when you have a strong prior reason to expect the effect in that direction, stated before data collection. One-tailed tests have more power (ability to detect a real effect) but at the cost of being unable to detect effects in the opposite direction.
Type I and Type II Errors
Every hypothesis test risks two types of errors. A Type I error (false positive, rate α) occurs when you reject a true null hypothesis. A Type II error (false negative, rate β) occurs when you fail to reject a false null hypothesis. Statistical power (1−β) is the probability of correctly detecting a real effect.
These errors trade off: decreasing α (making it harder to reject H₀) reduces Type I errors but increases Type II errors. The optimal balance depends on the relative costs of each error type in your application.
Power Analysis and Sample Size
Before conducting a study, power analysis helps determine the required sample size. You specify: the desired power (typically 0.80 or 0.90), significance level α, and the minimum effect size you want to detect. Underpowered studies frequently fail to detect real effects, leading to wasted resources and false negative conclusions.
A useful rule of thumb: to detect a medium effect size (Cohen's d = 0.5) with 80% power at α = 0.05 using a two-sample t-test requires approximately 64 participants per group.
Common Hypothesis Tests and When to Use Them
| Situation | Appropriate Test |
| Compare one mean to a known value, σ known | One-sample z-test |
| Compare one mean to a known value, σ unknown | One-sample t-test |
| Compare two independent group means | Two-sample t-test |
| Compare paired measurements (before/after) | Paired t-test |
| Compare 3+ group means | One-way ANOVA |
| Test independence in a contingency table | Chi-square test |
| Compare proportions | Proportion z-test |
| Non-parametric two-group comparison | Mann-Whitney U test |
Practical Example: Full Walk-Through
A pharmaceutical company claims their new painkiller reduces pain scores by more than 5 points on a 100-point scale. A trial recruits 40 patients (σ unknown, so use t-test).
H₀: μ_reduction ≤ 5 (no better than claimed) | H₁: μ_reduction > 5 (one-tailed)
Results: mean reduction = 7.3 points, s = 4.1. t = (7.3−5)/(4.1/√40) = 2.3/0.648 = 3.55
With df = 39 and α = 0.05, critical t = 1.685. Since 3.55 > 1.685, p ≈ 0.0005 < 0.05.
Conclusion: Reject H₀. There is strong statistical evidence that the painkiller reduces pain by more than 5 points on average.
Worked Example: Two-Sample Test in Practice
A supermarket chain wants to know whether a new store layout increases average transaction value. They measure 50 transactions before redesign: x̄₁ = $47.30, s₁ = $12.40. After redesign (50 transactions): x̄₂ = $52.80, s₂ = $14.20. Does the redesign work?
H₀: μ₁ = μ₂ (no change). H₁: μ₂ > μ₁ (one-tailed, redesign increases spending). Welch's t: SE = √(12.4²/50 + 14.2²/50) = √(3.0752 + 4.0328) = √7.108 = 2.666. t = (52.80−47.30)/2.666 = 5.50/2.666 = 2.063. df ≈ 96 (Welch). One-tailed critical t(96, 0.05) = 1.661. Since 2.063 > 1.661, p ≈ 0.021 < 0.05. Reject H₀. The redesign significantly increased average transaction value. 95% CI for increase: [5.50 − 1.985×2.666, ∞) = [$0.21, ∞). The improvement is at least $0.21 with 95% confidence — with 50 transactions/day, this adds at least $3,800/year per store.
Non-Parametric Alternatives: When Normality Fails
When data is clearly non-normal (heavy tails, strong skew) or ordinal, non-parametric tests provide valid alternatives. The Mann-Whitney U test (non-parametric two-sample test) ranks all observations together and tests whether one group tends to have higher ranks. The Wilcoxon signed-rank test is the paired non-parametric alternative. The Kruskal-Wallis test extends to 3+ groups. These tests sacrifice some power compared to parametric equivalents when normality holds, but maintain validity when it does not. Rule of thumb: use non-parametric tests when n < 30 and data is clearly non-normal, or always when data is on an ordinal scale (ratings, rankings, Likert scales).
Calculate Instantly — 100% Free
45 statistics calculators with step-by-step solutions, interactive charts, and PDF export. No sign-up needed.
▶ Open Free Statistics Calculator