The chi-square (χ²) test is the go-to statistical test for categorical data. It answers questions like: "Is this die fair?", "Is there a relationship between gender and voting preference?", "Does the distribution of blood types match genetic predictions?" This guide covers both major types of chi-square tests.
What is the Chi-Square Statistic?
χ² = Σ (O − E)² / E
Where O = observed frequency and E = expected frequency. The larger χ², the greater the discrepancy between what you observed and what you expected under H₀.
Type 1: Goodness of Fit Test
Purpose: Test whether observed frequencies for one categorical variable match a hypothesised distribution.
H₀: The observed frequencies follow the specified distribution.
df = k − 1 (where k = number of categories)
Example: A genetics experiment crosses two plants. Mendel's law predicts offspring ratios of 9:3:3:1 for four phenotypes. You observe 315, 108, 101, 32 offspring. Does this match the 9:3:3:1 prediction?
Expected: 556 total × 9/16 = 312.75, 556 × 3/16 = 104.25, 104.25, 34.75. χ² = 0.47. p = 0.93. Fail to reject H₀ — data matches Mendel's predictions.
Type 2: Test of Independence
Purpose: Test whether two categorical variables are related (independent or associated) using a contingency table.
H₀: The two variables are independent (no association).
df = (rows − 1) × (columns − 1)
Expected frequencies: E = (row total × column total) / grand total
Example: Survey 200 people on preferred exercise type (gym/running/cycling) by gender. Is exercise preference independent of gender?
Assumptions and Conditions
- Random sample from the population
- Independent observations — each person counted only once
- Expected frequencies ≥ 5 in each cell — the most important condition. If any E < 5, consider combining categories or using Fisher's Exact Test.
- Minimum total n ≥ 20
Interpreting the Results
- p < 0.05: Reject H₀. Significant deviation from expected (goodness of fit) or significant association (independence).
- p ≥ 0.05: Fail to reject H₀. No significant evidence of deviation or association.
After a significant result, always calculate effect size: Cramér's V = √(χ²/n×(min(r,c)−1)). V = 0.10 small, 0.30 medium, 0.50 large.
Chi-Square vs Fisher's Exact Test
For 2×2 tables with small expected frequencies (any E < 5), use Fisher's Exact Test instead of chi-square. Fisher's test computes an exact p-value without the large-sample approximation that chi-square requires.
Use our free Chi-Square Test Calculator for full results with step-by-step working, or our Fisher's Exact Test Calculator for small samples.
The Two Types of Chi-Square Tests
There are two fundamentally different chi-square tests that share the same distribution. The chi-square goodness-of-fit test compares an observed frequency distribution to an expected distribution (based on theory or a known population). The chi-square test of independence tests whether two categorical variables are associated in a contingency table. Both use the same test statistic formula but answer different questions.
The Chi-Square Test Statistic
For both tests: χ² = Σ[(Observed − Expected)² / Expected], summed over all cells. This statistic is always non-negative and follows the chi-square distribution under H₀. The degrees of freedom determine which chi-square distribution to use. Large χ² values indicate large discrepancies between observed and expected counts, providing evidence against H₀.
Chi-Square Test of Independence: Full Example
A researcher investigates whether coffee preference (coffee/tea) is related to work performance rating (excellent/satisfactory/poor). They survey 200 employees and arrange results in a 2×3 contingency table. Expected frequencies are calculated as: E = (Row Total × Column Total) / Grand Total. The χ² statistic is calculated from all 6 cells.
Degrees of freedom = (rows−1)×(columns−1) = (2−1)×(3−1) = 2. At α=0.05, critical χ² = 5.991. If χ² > 5.991, reject the null hypothesis of independence and conclude coffee preference is associated with performance rating.
Assumptions and the Expected Frequency Rule
For the chi-square approximation to be valid: all cells should have expected frequency ≥ 1, and no more than 20% of cells should have expected frequency < 5. When these conditions are violated, use Fisher's Exact Test (for 2×2 tables) or combine categories. Observed frequencies must be counts, not proportions or percentages.
Goodness-of-Fit Test Example
A genetics researcher expects offspring in a 9:3:3:1 ratio based on Mendel's laws. They observe 315, 108, 101, 32 offspring (total n=556). Expected: 312.75, 104.25, 104.25, 34.75. χ² = (315−312.75)²/312.75 + ... = 0.47. df = 4−1 = 3. Critical χ² = 7.815. Since 0.47 < 7.815, fail to reject H₀ — data is consistent with Mendelian ratios.
Measures of Association for Contingency Tables
A significant chi-square tells you association exists but not its strength. Effect size measures include:
- Phi (φ): For 2×2 tables, φ = √(χ²/n). Small: 0.1, Medium: 0.3, Large: 0.5
- Cramér's V: For larger tables, V = √(χ²/(n×min(r−1,c−1))). Same benchmarks as phi.
- Odds Ratio: For 2×2 tables, quantifies the odds of one outcome relative to another
Limitations of Chi-Square Tests
Chi-square tests have important limitations. They only detect whether association exists, not its direction or strength (beyond effect size measures). They require adequate sample sizes. They cannot be used for continuous data without categorisation (which loses information). They assume independent observations — don't use for matched or paired data (use McNemar's test instead).
The Two Types of Chi-Square Tests
There are two fundamentally different chi-square tests that share the same distribution. The chi-square goodness-of-fit test compares an observed frequency distribution to an expected distribution (based on theory or a known population). The chi-square test of independence tests whether two categorical variables are associated in a contingency table. Both use the same test statistic formula but answer different questions.
The Chi-Square Test Statistic
For both tests: χ² = Σ[(Observed − Expected)² / Expected], summed over all cells. This statistic is always non-negative and follows the chi-square distribution under H₀. The degrees of freedom determine which chi-square distribution to use. Large χ² values indicate large discrepancies between observed and expected counts, providing evidence against H₀.
Chi-Square Test of Independence: Full Example
A researcher investigates whether coffee preference (coffee/tea) is related to work performance rating (excellent/satisfactory/poor). They survey 200 employees and arrange results in a 2×3 contingency table. Expected frequencies are calculated as: E = (Row Total × Column Total) / Grand Total. The χ² statistic is calculated from all 6 cells.
Degrees of freedom = (rows−1)×(columns−1) = (2−1)×(3−1) = 2. At α=0.05, critical χ² = 5.991. If χ² > 5.991, reject the null hypothesis of independence and conclude coffee preference is associated with performance rating.
Assumptions and the Expected Frequency Rule
For the chi-square approximation to be valid: all cells should have expected frequency ≥ 1, and no more than 20% of cells should have expected frequency < 5. When these conditions are violated, use Fisher's Exact Test (for 2×2 tables) or combine categories. Observed frequencies must be counts, not proportions or percentages.
Goodness-of-Fit Test Example
A genetics researcher expects offspring in a 9:3:3:1 ratio based on Mendel's laws. They observe 315, 108, 101, 32 offspring (total n=556). Expected: 312.75, 104.25, 104.25, 34.75. χ² = (315−312.75)²/312.75 + ... = 0.47. df = 4−1 = 3. Critical χ² = 7.815. Since 0.47 < 7.815, fail to reject H₀ — data is consistent with Mendelian ratios.
Measures of Association for Contingency Tables
A significant chi-square tells you association exists but not its strength. Effect size measures include:
- Phi (φ): For 2×2 tables, φ = √(χ²/n). Small: 0.1, Medium: 0.3, Large: 0.5
- Cramér's V: For larger tables, V = √(χ²/(n×min(r−1,c−1))). Same benchmarks as phi.
- Odds Ratio: For 2×2 tables, quantifies the odds of one outcome relative to another
Limitations of Chi-Square Tests
Chi-square tests have important limitations. They only detect whether association exists, not its direction or strength (beyond effect size measures). They require adequate sample sizes. They cannot be used for continuous data without categorisation (which loses information). They assume independent observations — don't use for matched or paired data (use McNemar's test instead).
McNemar's Test for Paired Categorical Data
When you have paired or matched categorical data — for example, the same subjects rated before and after treatment — the standard chi-square test is inappropriate because observations are not independent. McNemar's test is designed specifically for this situation, examining only the discordant pairs (cases where classification changed). It is widely used in medical research to compare diagnostic test results or treatment responses in matched designs. The test statistic follows a chi-square distribution with 1 degree of freedom, but only the off-diagonal cells of the 2×2 table contribute to the test.
Complete Step-by-Step Example: Independence Test
A hospital surveys 400 patients about their satisfaction (Satisfied/Neutral/Dissatisfied) across three wards (Surgery/Medicine/Paediatrics). Observed counts:
| Ward | Satisfied | Neutral | Dissatisfied | Total |
| Surgery | 80 | 40 | 30 | 150 |
| Medicine | 70 | 50 | 30 | 150 |
| Paediatrics | 60 | 20 | 20 | 100 |
| Total | 210 | 110 | 80 | 400 |
Expected for Surgery/Satisfied = (150 × 210)/400 = 78.75. Calculate expected for all 9 cells. χ² = Σ(O−E)²/E. Computing all cells: χ² ≈ 5.47. df = (3−1)(3−1) = 4. Critical χ²(4, 0.05) = 9.488. Since 5.47 < 9.488, p ≈ 0.24. Fail to reject H₀ — no significant association between ward and satisfaction level at α=0.05.
Goodness-of-Fit: Testing a Genetic Hypothesis
Mendel predicted pea plant colours in a 3:1 ratio (dominant:recessive). Observed from 1000 plants: 740 dominant, 260 recessive. Expected: 750 and 250. χ² = (740−750)²/750 + (260−250)²/250 = 0.133 + 0.400 = 0.533. df = 1, critical χ²(0.05) = 3.841. Since 0.533 < 3.841, p ≈ 0.47. The observed ratio is consistent with Mendel's 3:1 prediction — a beautiful illustration of the goodness-of-fit test confirming a theoretical genetic model with real experimental data.
When Chi-Square Fails: Fisher's Exact as the Solution
A rare disease researcher has only 15 patients, comparing two treatments. Observed: Treatment A: 6 improved, 2 not; Treatment B: 2 improved, 5 not. The expected count for "B/Not improved" is only 2.33, violating the chi-square assumption. Fisher's Exact Test computes: p = (8!7!8!7!)/(15! × 6! × 2! × 2! × 5!) ≈ 0.065. At α=0.05, fail to reject H₀ — the difference is not statistically significant with these small samples. The lesson: always check expected cell counts before applying chi-square, and switch to Fisher's Exact when cells are too small.
Calculate Instantly — 100% Free
45 statistics calculators with step-by-step solutions, interactive charts, and PDF export. No sign-up needed.
▶ Open Free Statistics Calculator
Deep Dive: Chi Square Test Explained — Theory, Assumptions, and Best Practices
This section provides a comprehensive look at the Chi Square Test Explained — covering the mathematical theory, step-by-step worked examples, complete assumptions checking, effect size reporting, common mistakes, and real-world applications that go beyond introductory coverage.
Mathematical Foundation
Every statistical procedure rests on a mathematical model of how data is generated. The Chi Square Test Explained assumes specific data-generating conditions that, when satisfied, guarantee the stated Type I error rate and power. Understanding these foundations helps you know when results are trustworthy and when to seek alternatives.
Assumptions and Diagnostics
Before interpreting any result, verify all assumptions are satisfied. Common assumption violations and their remedies:
- Non-normality: For small samples, use non-parametric alternatives or bootstrap methods. For large samples, the Central Limit Theorem typically provides robustness.
- Outliers: Identify using IQR fence or modified z-scores. Investigate each outlier — correct data errors, but do not delete genuine extreme observations without disclosure.
- Independence violations: Clustered or longitudinal data requires mixed models or GEE rather than standard methods assuming independence.
Interpreting Your Results Completely
A complete interpretation always includes: (1) the test statistic value, (2) degrees of freedom, (3) exact p-value, (4) confidence interval for the parameter of interest, (5) effect size with interpretation, and (6) a plain-language conclusion. Never report just a p-value — it communicates only one dimension of a multi-dimensional result.
Effect Size and Practical Significance
Statistical significance tells you that an effect is detectable; effect size tells you whether it matters. For every test, compute and report the appropriate effect size measure alongside the p-value. Use field-specific benchmarks (not just Cohen's generic small/medium/large) to evaluate practical significance.
Common Errors and How to Avoid Them
- Multiple testing without correction: Apply Bonferroni, Holm, or FDR corrections whenever running more than one test on the same dataset.
- Confusing statistical and practical significance: Always ask "is this large enough to matter?" not just "is this detectable?"
- p-hacking: Pre-register hypotheses, analysis plans, and significance thresholds before seeing data.
- Overlooking assumptions: Verify independence, normality (or large n), and homogeneity of variance before applying parametric tests.
When This Test Is Not Appropriate
Every test has boundaries of appropriate application. Understand when to use non-parametric alternatives, when to switch to more complex models, and when the research question requires a different analytic framework entirely. Using the wrong test produces incorrect Type I error rates and power — even if the computation is done correctly.
Reporting in Academic and Professional Contexts
Follow APA 7th edition reporting format for academic publications: report the test statistic with its symbol (t, F, χ², z), degrees of freedom in parentheses, exact p-value to two or three decimal places, and confidence intervals. Example: "A one-sample t-test indicated that study time significantly exceeded the 10-hour benchmark, t(23) = 2.84, p = .009, d = 0.58, 95% CI [10.7, 13.2]."
Statistical Reasoning: Building Intuition Through Examples
Statistical mastery comes from seeing the same concepts applied across many different contexts. The following worked examples and case studies reinforce the core principles while showing their breadth of application across medicine, social science, business, engineering, and natural science.
Case Study 1: Healthcare Research Application
A clinical researcher wants to evaluate whether a new physical therapy protocol reduces recovery time after knee surgery. The study design, data collection, statistical analysis, and interpretation each require careful thought. The researcher must choose appropriate sample sizes, select the right statistical test, verify all assumptions, compute the test statistic and p-value, report the effect size with confidence interval, and interpret the result in terms patients and clinicians can understand. Each step builds on a solid understanding of statistical theory.
Case Study 2: Business Analytics Application
An e-commerce company wants to know if customers who see a new product recommendation algorithm spend more money per session. They have access to data from 50,000 user sessions split evenly between the old and new algorithms. The statistical question is clear, but practical considerations — multiple testing across different metrics, confounding by device type and geography, and the distinction between statistical and business significance — require careful navigation. Understanding the underlying statistical framework guides every analytical decision.
Case Study 3: Educational Assessment
A school district implements a new math curriculum and wants to evaluate its effectiveness using standardized test scores. Before-after comparisons, control group selection, and the inevitable regression-to-the-mean effect must all be addressed. Measuring whether changes are genuine improvements or statistical artifacts requires the full toolkit: descriptive statistics, assumption checking, appropriate tests for the design, effect size calculation, and honest acknowledgment of limitations.
Understanding Output from Statistical Software
When you run this analysis in R, Python, SPSS, or Stata, the software produces detailed output with more numbers than you need for any single analysis. Knowing which numbers are essential (test statistic, df, p-value, CI, effect size) vs. diagnostic vs. supplementary is a critical skill. Our calculator extracts the key results and presents them in a clear, interpretable format — but understanding what each number means, where it comes from, and what would make it change is what separates a statistician from a button-pusher.
Integrating Multiple Analyses
Real research rarely involves a single statistical test in isolation. Typically, a full analysis includes: (1) data quality checks and outlier investigation, (2) descriptive statistics for all key variables, (3) visualization of distributions and relationships, (4) assumption verification for planned inferential tests, (5) primary inferential analysis with effect size and CI, (6) sensitivity analyses testing robustness to assumption violations, and (7) subgroup analyses if pre-specified. This holistic approach produces more trustworthy and complete results than any single test alone.
Statistical Software Commands Reference
For those implementing these analyses computationally: R provides comprehensive implementations through base R and packages like stats, car, lme4, and ggplot2 for visualization. Python users rely on scipy.stats, statsmodels, and pingouin for statistical testing. Both languages offer excellent power analysis tools (R: pwr package; Python: statsmodels.stats.power). SPSS and Stata provide menu-driven interfaces alongside powerful command syntax for reproducible analyses. Learning at least one of these tools is essential for any applied statistician or data scientist.
Frequently Asked Questions: Advanced Topics
These questions address subtle points that often confuse even experienced analysts:
Can I use this test with non-normal data?
For large samples (generally n ≥ 30 per group), the Central Limit Theorem ensures that test statistics based on sample means are approximately normally distributed regardless of the population distribution. For small samples with clearly non-normal data, use a non-parametric alternative or bootstrap methods. The key question is not "is my data normal?" but "is the sampling distribution of my test statistic approximately normal?" These are different questions with different answers.
How do I handle missing data?
Missing data is ubiquitous in real research. Complete case analysis (listwise deletion) is the default in most software but can introduce bias if data is not Missing Completely At Random (MCAR). Better approaches: multiple imputation (creates several complete datasets, analyzes each, and pools results using Rubin's rules) and maximum likelihood methods (FIML/EM algorithm). The choice depends on the missing data mechanism and the nature of the analysis. Never delete variables with many missing values without considering the implications.
What is the difference between a one-sided and two-sided test?
A two-sided test rejects H₀ if the test statistic is extreme in either direction. A one-sided test rejects only in the pre-specified direction. The one-sided p-value is half the two-sided p-value for symmetric test statistics. Use a one-sided test only if: (1) the research question is inherently directional, (2) the direction was specified before data collection, and (3) results in the opposite direction would have no practical meaning. Never switch from two-sided to one-sided after seeing which direction the data points — this doubles the effective false positive rate.
How should I report results in a research paper?
Follow APA 7th edition: report the test statistic with its symbol (t, F, χ², z, U), degrees of freedom in parentheses (except for z-tests), exact p-value to two-three decimal places (write "p = .032" not "p < .05"), effect size with confidence interval, and the direction of the effect. Example for a t-test: "The experimental group (M = 72.4, SD = 8.1) scored significantly higher than the control group (M = 68.1, SD = 9.3), t(48) = 1.88, p = .033, d = 0.50, 95% CI for difference [0.34, 8.26]." This one sentence communicates the complete statistical story.