Hypothesis Testing: Step-by-Step Guide with Examples (2026)

Hypothesis testing is the cornerstone of statistical inference. It provides a formal, systematic way to evaluate claims about populations using sample data. From clinical trials to A/B testing, from quality control to psychological research — hypothesis testing is everywhere. This guide walks you through every step clearly.

What is a Statistical Hypothesis?

A statistical hypothesis is a claim about a population parameter (like a mean, proportion, or variance). We use sample data to test whether there is enough evidence to support or refute this claim.

Every hypothesis test involves two competing hypotheses:

Null Hypothesis (H₀): The default assumption — no effect, no difference, no relationship. Example: H₀: μ = 50. You try to disprove this.

Alternative Hypothesis (H₁ or Hₐ): What you are trying to show. Example: H₁: μ ≠ 50. This is supported if you reject H₀.

The 6 Steps of Hypothesis Testing

Step 1: State the Hypotheses

Write explicit H₀ and H₁ before collecting data. Decide whether the test is two-tailed (H₁: μ ≠ μ₀), left-tailed (H₁: μ < μ₀), or right-tailed (H₁: μ > μ₀).

Step 2: Choose the Significance Level (α)

α is the probability of rejecting H₀ when it is actually true (Type I error rate). Set α before collecting data. Common choices: α = 0.05, α = 0.01, α = 0.001 (for medical or safety research).

Step 3: Collect Data and Choose the Appropriate Test

Situation	Test to Use
Test one mean (σ unknown)	One-sample t-test
Compare two independent means	Two-sample t-test
Compare before/after (paired)	Paired t-test
Compare 3+ group means	One-way ANOVA
Test a proportion	One-proportion z-test
Test categorical frequencies	Chi-square test
Non-normal 2-group comparison	Mann-Whitney U test
Non-normal 3+ groups	Kruskal-Wallis test

Step 4: Calculate the Test Statistic

The test statistic measures how far your sample result is from what H₀ predicts, in standard error units. General form:

Test statistic = (Observed − Expected under H₀) / Standard Error

Step 5: Find the P-Value

The p-value is the probability of getting a test statistic this extreme or more extreme, assuming H₀ is true. Use statistical tables or a calculator. Smaller p-values provide stronger evidence against H₀.

Step 6: Make a Decision

If p < α → Reject H₀. Results are statistically significant. There is sufficient evidence for H₁.
If p ≥ α → Fail to reject H₀. Insufficient evidence against H₀. This does NOT mean H₀ is true.

Complete Worked Example

A nutritionist claims that a new diet reduces mean daily calorie intake below 2,000 calories. A sample of 25 participants following the diet had mean intake x̄ = 1,850 calories with SD s = 300 calories. Test at α = 0.05.

Step 1: H₀: μ = 2000 | H₁: μ < 2000 (left-tailed test)

Step 2: α = 0.05, one-tailed

Step 3: One-sample t-test (σ unknown, sample data available)

Step 4: t = (1850 − 2000) / (300/√25) = −150/60 = −2.50

Step 5: df = 24. P(T < −2.50) = 0.010

Step 6: p = 0.010 < 0.05 → Reject H₀. The diet significantly reduces calorie intake below 2,000 calories (t(24) = −2.50, p = 0.010).

Type I and Type II Errors

Decision	H₀ is Actually True	H₀ is Actually False
Reject H₀	❌ Type I Error (α) — False Positive	✅ Correct — True Positive (Power)
Fail to Reject H₀	✅ Correct — True Negative	❌ Type II Error (β) — False Negative

You cannot minimise both simultaneously. Reducing α (stricter test) increases β (more false negatives). The solution is a larger sample size — which reduces both types of error.

Statistical Power

Power = 1 − β = the probability of correctly rejecting H₀ when the alternative is true. Power ≥ 0.80 is typically required. Power increases with: larger n, larger effect size, higher α, and lower variability.

Always conduct a power analysis before your study to ensure you collect enough data to detect your expected effect. Use our Sample Size Calculator for this.

Reporting Hypothesis Test Results

Always report: the test used, test statistic, degrees of freedom, p-value, and effect size. Example APA format: t(24) = −2.50, p = .010, d = −0.50

The Logic of Hypothesis Testing

Hypothesis testing is built on a logical framework called proof by contradiction. Rather than directly proving your theory (the alternative hypothesis), you assume the opposite (null hypothesis) is true and examine whether your data is consistent with this assumption. If your data would be extremely unlikely under the null hypothesis, you reject it in favour of the alternative.

This approach was formalised by Jerzy Neyman and Egon Pearson in the 1930s as a decision framework for scientific research. It provides a systematic, reproducible way to make decisions from uncertain data.

The Five Steps of Hypothesis Testing

Step 1: State the hypotheses. The null hypothesis H₀ is your default position — usually "no effect," "no difference," or "no relationship." The alternative hypothesis H₁ is what you are trying to establish. Be specific and state hypotheses before collecting data to avoid bias.

Step 2: Set the significance level α. This is the probability of incorrectly rejecting a true null hypothesis (Type I error rate). Convention is α = 0.05, but choose based on the consequences of errors in your context. Medical trials often use α = 0.01 for drug approval.

Step 3: Choose and calculate the test statistic. The test statistic summarises your data into a single number that measures how far your observation is from what H₀ predicts. Common choices: t-statistic for means, z-statistic for proportions, F-statistic for variances, χ²-statistic for frequencies.

Step 4: Calculate the p-value and compare to α. Find the probability of observing a test statistic as extreme as yours under H₀. If p < α, reject H₀. If p ≥ α, fail to reject H₀.

Step 5: State your conclusion in context. Never say "prove" or "accept H₀." Say "reject H₀ at the 5% significance level" or "there is insufficient evidence to reject H₀."

One-Tailed vs Two-Tailed Tests

A two-tailed test detects effects in either direction. A one-tailed test is only sensitive to effects in one specified direction. Use one-tailed tests only when you have a strong prior reason to expect the effect in that direction, stated before data collection. One-tailed tests have more power (ability to detect a real effect) but at the cost of being unable to detect effects in the opposite direction.

Type I and Type II Errors

Every hypothesis test risks two types of errors. A Type I error (false positive, rate α) occurs when you reject a true null hypothesis. A Type II error (false negative, rate β) occurs when you fail to reject a false null hypothesis. Statistical power (1−β) is the probability of correctly detecting a real effect.

These errors trade off: decreasing α (making it harder to reject H₀) reduces Type I errors but increases Type II errors. The optimal balance depends on the relative costs of each error type in your application.

Power Analysis and Sample Size

Before conducting a study, power analysis helps determine the required sample size. You specify: the desired power (typically 0.80 or 0.90), significance level α, and the minimum effect size you want to detect. Underpowered studies frequently fail to detect real effects, leading to wasted resources and false negative conclusions.

A useful rule of thumb: to detect a medium effect size (Cohen's d = 0.5) with 80% power at α = 0.05 using a two-sample t-test requires approximately 64 participants per group.

Common Hypothesis Tests and When to Use Them

Situation	Appropriate Test
Compare one mean to a known value, σ known	One-sample z-test
Compare one mean to a known value, σ unknown	One-sample t-test
Compare two independent group means	Two-sample t-test
Compare paired measurements (before/after)	Paired t-test
Compare 3+ group means	One-way ANOVA
Test independence in a contingency table	Chi-square test
Compare proportions	Proportion z-test
Non-parametric two-group comparison	Mann-Whitney U test

Practical Example: Full Walk-Through

A pharmaceutical company claims their new painkiller reduces pain scores by more than 5 points on a 100-point scale. A trial recruits 40 patients (σ unknown, so use t-test).

H₀: μ_reduction ≤ 5 (no better than claimed) | H₁: μ_reduction > 5 (one-tailed)

Results: mean reduction = 7.3 points, s = 4.1. t = (7.3−5)/(4.1/√40) = 2.3/0.648 = 3.55

With df = 39 and α = 0.05, critical t = 1.685. Since 3.55 > 1.685, p ≈ 0.0005 < 0.05.

Conclusion: Reject H₀. There is strong statistical evidence that the painkiller reduces pain by more than 5 points on average.

The Logic of Hypothesis Testing

The Five Steps of Hypothesis Testing

Step 4: Calculate the p-value and compare to α. Find the probability of observing a test statistic as extreme as yours under H₀. If p < α, reject H₀. If p ≥ α, fail to reject H₀.

Step 5: State your conclusion in context. Never say "prove" or "accept H₀." Say "reject H₀ at the 5% significance level" or "there is insufficient evidence to reject H₀."

One-Tailed vs Two-Tailed Tests

Type I and Type II Errors

Power Analysis and Sample Size

A useful rule of thumb: to detect a medium effect size (Cohen's d = 0.5) with 80% power at α = 0.05 using a two-sample t-test requires approximately 64 participants per group.

Common Hypothesis Tests and When to Use Them

Situation	Appropriate Test
Compare one mean to a known value, σ known	One-sample z-test
Compare one mean to a known value, σ unknown	One-sample t-test
Compare two independent group means	Two-sample t-test
Compare paired measurements (before/after)	Paired t-test
Compare 3+ group means	One-way ANOVA
Test independence in a contingency table	Chi-square test
Compare proportions	Proportion z-test
Non-parametric two-group comparison	Mann-Whitney U test

Practical Example: Full Walk-Through

A pharmaceutical company claims their new painkiller reduces pain scores by more than 5 points on a 100-point scale. A trial recruits 40 patients (σ unknown, so use t-test).

H₀: μ_reduction ≤ 5 (no better than claimed) | H₁: μ_reduction > 5 (one-tailed)

Results: mean reduction = 7.3 points, s = 4.1. t = (7.3−5)/(4.1/√40) = 2.3/0.648 = 3.55

With df = 39 and α = 0.05, critical t = 1.685. Since 3.55 > 1.685, p ≈ 0.0005 < 0.05.

Conclusion: Reject H₀. There is strong statistical evidence that the painkiller reduces pain by more than 5 points on average.

Worked Example: Two-Sample Test in Practice

A supermarket chain wants to know whether a new store layout increases average transaction value. They measure 50 transactions before redesign: x̄₁ = $47.30, s₁ = $12.40. After redesign (50 transactions): x̄₂ = $52.80, s₂ = $14.20. Does the redesign work?

H₀: μ₁ = μ₂ (no change). H₁: μ₂ > μ₁ (one-tailed, redesign increases spending). Welch's t: SE = √(12.4²/50 + 14.2²/50) = √(3.0752 + 4.0328) = √7.108 = 2.666. t = (52.80−47.30)/2.666 = 5.50/2.666 = 2.063. df ≈ 96 (Welch). One-tailed critical t(96, 0.05) = 1.661. Since 2.063 > 1.661, p ≈ 0.021 < 0.05. Reject H₀. The redesign significantly increased average transaction value. 95% CI for increase: [5.50 − 1.985×2.666, ∞) = [$0.21, ∞). The improvement is at least $0.21 with 95% confidence — with 50 transactions/day, this adds at least $3,800/year per store.

Non-Parametric Alternatives: When Normality Fails

When data is clearly non-normal (heavy tails, strong skew) or ordinal, non-parametric tests provide valid alternatives. The Mann-Whitney U test (non-parametric two-sample test) ranks all observations together and tests whether one group tends to have higher ranks. The Wilcoxon signed-rank test is the paired non-parametric alternative. The Kruskal-Wallis test extends to 3+ groups. These tests sacrifice some power compared to parametric equivalents when normality holds, but maintain validity when it does not. Rule of thumb: use non-parametric tests when n < 30 and data is clearly non-normal, or always when data is on an ordinal scale (ratings, rankings, Likert scales).

Calculate Instantly — 100% Free

45 statistics calculators with step-by-step solutions, interactive charts, and PDF export. No sign-up needed.

▶ Open Free Statistics Calculator

🔗 Related Resources

Statistical Meth T-Test Calculator → Statistical Meth ANOVA Calculator → Statistical Meth How to Calculate P-Value → All Articles Browse All Statistics Articles →

Hypothesis Testing — Step-by-Step Guide

What is a Statistical Hypothesis?

The 6 Steps of Hypothesis Testing

Step 1: State the Hypotheses

Step 2: Choose the Significance Level (α)

Step 3: Collect Data and Choose the Appropriate Test

Step 4: Calculate the Test Statistic

Step 5: Find the P-Value

Step 6: Make a Decision

Complete Worked Example

Type I and Type II Errors

Statistical Power

Reporting Hypothesis Test Results

The Logic of Hypothesis Testing

The Five Steps of Hypothesis Testing

One-Tailed vs Two-Tailed Tests

Type I and Type II Errors

Power Analysis and Sample Size

Common Hypothesis Tests and When to Use Them

Practical Example: Full Walk-Through

The Logic of Hypothesis Testing

The Five Steps of Hypothesis Testing

One-Tailed vs Two-Tailed Tests

Type I and Type II Errors

Power Analysis and Sample Size

Common Hypothesis Tests and When to Use Them

Practical Example: Full Walk-Through

Worked Example: Two-Sample Test in Practice

Non-Parametric Alternatives: When Normality Fails

Calculate Instantly — 100% Free