The t-test is the most commonly used statistical test in science, medicine, social research, and business. It tests whether means are significantly different. This guide covers all three types clearly, with worked examples and decision rules.
When to Use a T-Test
- Outcome variable is continuous
- Population standard deviation (σ) is unknown
- Sample size is small (n < 30) — for large samples, t approaches z
- Data is approximately normally distributed within groups (or n ≥ 30 by CLT)
Type 1: One-Sample T-Test
Question: Is the sample mean significantly different from a hypothesised population mean μ₀?
t = (x̄ − μ₀) / (s/√n), df = n−1
Example: A food company claims a snack bar contains 250 calories. You measure 20 bars: x̄ = 262, s = 18. Test if mean differs from 250 at α = 0.05.
t = (262 − 250) / (18/√20) = 12/4.025 = 2.98, df = 19, p = 0.008 → reject H₀. The bars contain significantly more than 250 calories.
Use our One-Sample T-Test Calculator.
Type 2: Two-Sample T-Test (Independent)
Question: Do two independent groups have different means?
Welch's t-test (preferred — does not assume equal variances):
t = (x̄₁ − x̄₂) / √(s₁²/n₁ + s₂²/n₂)
Degrees of freedom use the Welch-Satterthwaite approximation (complex formula — our calculator handles this automatically).
Example: Does a new teaching method improve test scores? Control group (n=25): x̄=68, s=10. Treatment group (n=25): x̄=74, s=12. t = 6/√(100/25 + 144/25) = 6/√9.76 = 1.92, p = 0.061 → fail to reject H₀ at α = 0.05 (borderline).
Use our Two-Sample T-Test Calculator.
Type 3: Paired T-Test
Question: Is the mean difference between paired measurements significantly different from zero?
t = d̄ / (s_d/√n), df = n−1
Where d̄ = mean difference between pairs, s_d = SD of differences, n = number of pairs.
Example: Blood pressure measured before and after a drug for n=15 patients. Mean difference d̄ = −12 mmHg, s_d = 8. t = −12/(8/√15) = −12/2.066 = −5.81, df = 14, p < 0.001 → highly significant reduction.
Use our Paired T-Test Calculator.
Choosing Between the Three Types
| Situation | Test to Use |
| One group vs hypothesised value | One-sample t-test |
| Two unrelated groups | Two-sample t-test (Welch's) |
| Same subjects measured twice (before/after) | Paired t-test |
| More than 2 groups | One-way ANOVA |
T-Test vs Z-Test
Use a z-test when σ is known. Use a t-test when σ is unknown (virtually always). At n ≥ 30, the t-distribution approximates the normal, so the distinction matters less. For proportions, use the z-test for proportions.
Effect Size for T-Tests
Always report Cohen's d alongside the t-test result: d = (x̄₁ − x̄₂) / s_pooled. Small: 0.2, Medium: 0.5, Large: 0.8. Tells readers how meaningful the difference is, not just whether it is significant.
What is a T-Test?
A t-test is a hypothesis test that compares means using a test statistic that follows a t-distribution under the null hypothesis. The t-distribution was developed by William Gosset (writing under the pseudonym "Student") in 1908 while working at the Guinness brewery, hence "Student's t-test." It is used when the population standard deviation is unknown — which is virtually always in practice.
Types of T-Tests
There are three main types, each designed for a different situation:
- One-sample t-test: Compares a sample mean to a known or hypothesised population value
- Independent samples t-test: Compares means of two independent groups
- Paired t-test: Compares means from the same subjects under two conditions (before/after, matched pairs)
One-Sample T-Test: Formula and Example
Test statistic: t = (x̄ − μ₀) / (s/√n), with df = n−1. Example: A manufacturer claims their bolts have a mean diameter of 10mm. You sample 25 bolts and find x̄ = 10.3mm, s = 0.5mm. t = (10.3−10)/(0.5/√25) = 0.3/0.1 = 3.0. With df=24 and α=0.05 (two-tailed), critical t = 2.064. Since 3.0 > 2.064, reject H₀. The bolts do not meet specification.
Independent Samples T-Test: Welch's vs Pooled
The classical pooled t-test assumes equal variances between groups. Welch's t-test does not make this assumption and is more robust. Modern statistical practice recommends always using Welch's test (it is the default in R and most software), as it performs nearly as well as the pooled test when variances are equal but much better when they are not.
Welch's test statistic: t = (x̄₁−x̄₂) / √(s₁²/n₁ + s₂²/n₂). The degrees of freedom are estimated using the Welch-Satterthwaite equation, which gives a non-integer value.
Paired T-Test: When and Why
The paired t-test is more powerful than the independent samples test when subjects can be matched. By computing the difference for each pair (d = x₁ − x₂), you remove between-subject variability, which can be very large. You then conduct a one-sample t-test on the differences: t = d̄ / (s_d/√n).
Use paired t-test for: before/after measurements on the same subject, matched case-control studies, cross-over clinical trials where subjects receive both treatments in random order.
Assumptions and Robustness
T-tests assume: data is approximately normally distributed within groups (or n is large enough for CLT), observations are independent, and (for pooled t-test) equal variances. The t-test is quite robust to non-normality for n > 30. For small samples from clearly non-normal distributions, consider the Mann-Whitney U test (non-parametric alternative to independent t-test) or Wilcoxon signed-rank test (non-parametric paired alternative).
Confidence Intervals from T-Tests
Every t-test produces a confidence interval for the true difference (or mean). The 95% CI for a one-sample test is: x̄ ± t*(s/√n). For the difference of two means: (x̄₁−x̄₂) ± t* × SE_diff. These intervals are more informative than p-values alone — they show both whether the effect is significant and how large it plausibly is.
Effect Size for T-Tests: Cohen's d
After finding a significant result, report Cohen's d = (x̄₁−x̄₂)/s_pooled. Benchmarks: d = 0.2 (small), 0.5 (medium), 0.8 (large). A t-test with p < 0.001 but d = 0.1 suggests a statistically detectable but practically negligible difference — especially common with very large samples.
What is a T-Test?
A t-test is a hypothesis test that compares means using a test statistic that follows a t-distribution under the null hypothesis. The t-distribution was developed by William Gosset (writing under the pseudonym "Student") in 1908 while working at the Guinness brewery, hence "Student's t-test." It is used when the population standard deviation is unknown — which is virtually always in practice.
Types of T-Tests
There are three main types, each designed for a different situation:
- One-sample t-test: Compares a sample mean to a known or hypothesised population value
- Independent samples t-test: Compares means of two independent groups
- Paired t-test: Compares means from the same subjects under two conditions (before/after, matched pairs)
One-Sample T-Test: Formula and Example
Test statistic: t = (x̄ − μ₀) / (s/√n), with df = n−1. Example: A manufacturer claims their bolts have a mean diameter of 10mm. You sample 25 bolts and find x̄ = 10.3mm, s = 0.5mm. t = (10.3−10)/(0.5/√25) = 0.3/0.1 = 3.0. With df=24 and α=0.05 (two-tailed), critical t = 2.064. Since 3.0 > 2.064, reject H₀. The bolts do not meet specification.
Independent Samples T-Test: Welch's vs Pooled
The classical pooled t-test assumes equal variances between groups. Welch's t-test does not make this assumption and is more robust. Modern statistical practice recommends always using Welch's test (it is the default in R and most software), as it performs nearly as well as the pooled test when variances are equal but much better when they are not.
Welch's test statistic: t = (x̄₁−x̄₂) / √(s₁²/n₁ + s₂²/n₂). The degrees of freedom are estimated using the Welch-Satterthwaite equation, which gives a non-integer value.
Paired T-Test: When and Why
The paired t-test is more powerful than the independent samples test when subjects can be matched. By computing the difference for each pair (d = x₁ − x₂), you remove between-subject variability, which can be very large. You then conduct a one-sample t-test on the differences: t = d̄ / (s_d/√n).
Use paired t-test for: before/after measurements on the same subject, matched case-control studies, cross-over clinical trials where subjects receive both treatments in random order.
Assumptions and Robustness
T-tests assume: data is approximately normally distributed within groups (or n is large enough for CLT), observations are independent, and (for pooled t-test) equal variances. The t-test is quite robust to non-normality for n > 30. For small samples from clearly non-normal distributions, consider the Mann-Whitney U test (non-parametric alternative to independent t-test) or Wilcoxon signed-rank test (non-parametric paired alternative).
Confidence Intervals from T-Tests
Every t-test produces a confidence interval for the true difference (or mean). The 95% CI for a one-sample test is: x̄ ± t*(s/√n). For the difference of two means: (x̄₁−x̄₂) ± t* × SE_diff. These intervals are more informative than p-values alone — they show both whether the effect is significant and how large it plausibly is.
Effect Size for T-Tests: Cohen's d
After finding a significant result, report Cohen's d = (x̄₁−x̄₂)/s_pooled. Benchmarks: d = 0.2 (small), 0.5 (medium), 0.8 (large). A t-test with p < 0.001 but d = 0.1 suggests a statistically detectable but practically negligible difference — especially common with very large samples.
Full Worked Example: Independent Samples T-Test
A researcher compares exam scores between two teaching methods. Method A (n=12): 72, 68, 75, 80, 65, 78, 71, 77, 69, 74, 76, 73. Method B (n=12): 81, 85, 79, 88, 84, 82, 87, 80, 86, 83, 85, 78.
Method A: x̄₁=73.17, s₁=4.22. Method B: x̄₂=83.17, s₂=3.19.
Welch's t-test: SE = √(4.22²/12 + 3.19²/12) = √(1.484 + 0.848) = √2.332 = 1.527. t = (73.17−83.17)/1.527 = −10.00/1.527 = −6.55. Welch-Satterthwaite df ≈ 21. Critical t(21, 0.05, two-tailed) = 2.080. Since |−6.55| >> 2.080, p < 0.001. Method B significantly outperforms Method A.
Cohen's d = 10.00 / √((4.22²+3.19²)/2) = 10.00/√(13.68) = 10.00/3.70 = 2.70 — an enormous effect by any benchmark. 95% CI for the difference: −10.00 ± 2.080×1.527 = [−13.17, −6.83]. The difference is practically and statistically very significant.
Paired T-Test: Before-After Weight Loss Study
Ten participants follow a diet for 12 weeks. Weights before and after (kg): Before: 85, 92, 78, 96, 88, 102, 75, 91, 84, 95. After: 81, 88, 74, 90, 83, 96, 72, 86, 80, 90.
Differences (before−after): 4, 4, 4, 6, 5, 6, 3, 5, 4, 5. d̄ = 4.60 kg. s_d = 0.97 kg.
t = d̄/(s_d/√n) = 4.60/(0.97/√10) = 4.60/0.307 = 14.99. df = 9. t_critical(0.05, two-tailed) = 2.262. Since 14.99 >> 2.262, p < 0.001. The diet significantly reduces weight. 95% CI for mean weight loss: 4.60 ± 2.262×0.307 = [3.91, 5.29] kg. The paired design is powerful here — by removing between-subject variability (people differ in initial weight), we isolate the diet's effect with just 10 participants.
Calculate Instantly — 100% Free
45 statistics calculators with step-by-step solutions, interactive charts, and PDF export. No sign-up needed.
▶ Open Free Statistics Calculator
Deep Dive: T Test Complete Guide — Theory, Assumptions, and Best Practices
This section provides a comprehensive look at the T Test Complete Guide — covering the mathematical theory, step-by-step worked examples, complete assumptions checking, effect size reporting, common mistakes, and real-world applications that go beyond introductory coverage.
Mathematical Foundation
Every statistical procedure rests on a mathematical model of how data is generated. The T Test Complete Guide assumes specific data-generating conditions that, when satisfied, guarantee the stated Type I error rate and power. Understanding these foundations helps you know when results are trustworthy and when to seek alternatives.
Assumptions and Diagnostics
Before interpreting any result, verify all assumptions are satisfied. Common assumption violations and their remedies:
- Non-normality: For small samples, use non-parametric alternatives or bootstrap methods. For large samples, the Central Limit Theorem typically provides robustness.
- Outliers: Identify using IQR fence or modified z-scores. Investigate each outlier — correct data errors, but do not delete genuine extreme observations without disclosure.
- Independence violations: Clustered or longitudinal data requires mixed models or GEE rather than standard methods assuming independence.
Interpreting Your Results Completely
A complete interpretation always includes: (1) the test statistic value, (2) degrees of freedom, (3) exact p-value, (4) confidence interval for the parameter of interest, (5) effect size with interpretation, and (6) a plain-language conclusion. Never report just a p-value — it communicates only one dimension of a multi-dimensional result.
Effect Size and Practical Significance
Statistical significance tells you that an effect is detectable; effect size tells you whether it matters. For every test, compute and report the appropriate effect size measure alongside the p-value. Use field-specific benchmarks (not just Cohen's generic small/medium/large) to evaluate practical significance.
Common Errors and How to Avoid Them
- Multiple testing without correction: Apply Bonferroni, Holm, or FDR corrections whenever running more than one test on the same dataset.
- Confusing statistical and practical significance: Always ask "is this large enough to matter?" not just "is this detectable?"
- p-hacking: Pre-register hypotheses, analysis plans, and significance thresholds before seeing data.
- Overlooking assumptions: Verify independence, normality (or large n), and homogeneity of variance before applying parametric tests.
When This Test Is Not Appropriate
Every test has boundaries of appropriate application. Understand when to use non-parametric alternatives, when to switch to more complex models, and when the research question requires a different analytic framework entirely. Using the wrong test produces incorrect Type I error rates and power — even if the computation is done correctly.
Reporting in Academic and Professional Contexts
Follow APA 7th edition reporting format for academic publications: report the test statistic with its symbol (t, F, χ², z), degrees of freedom in parentheses, exact p-value to two or three decimal places, and confidence intervals. Example: "A one-sample t-test indicated that study time significantly exceeded the 10-hour benchmark, t(23) = 2.84, p = .009, d = 0.58, 95% CI [10.7, 13.2]."
Statistical Reasoning: Building Intuition Through Examples
Statistical mastery comes from seeing the same concepts applied across many different contexts. The following worked examples and case studies reinforce the core principles while showing their breadth of application across medicine, social science, business, engineering, and natural science.
Case Study 1: Healthcare Research Application
A clinical researcher wants to evaluate whether a new physical therapy protocol reduces recovery time after knee surgery. The study design, data collection, statistical analysis, and interpretation each require careful thought. The researcher must choose appropriate sample sizes, select the right statistical test, verify all assumptions, compute the test statistic and p-value, report the effect size with confidence interval, and interpret the result in terms patients and clinicians can understand. Each step builds on a solid understanding of statistical theory.
Case Study 2: Business Analytics Application
An e-commerce company wants to know if customers who see a new product recommendation algorithm spend more money per session. They have access to data from 50,000 user sessions split evenly between the old and new algorithms. The statistical question is clear, but practical considerations — multiple testing across different metrics, confounding by device type and geography, and the distinction between statistical and business significance — require careful navigation. Understanding the underlying statistical framework guides every analytical decision.
Case Study 3: Educational Assessment
A school district implements a new math curriculum and wants to evaluate its effectiveness using standardized test scores. Before-after comparisons, control group selection, and the inevitable regression-to-the-mean effect must all be addressed. Measuring whether changes are genuine improvements or statistical artifacts requires the full toolkit: descriptive statistics, assumption checking, appropriate tests for the design, effect size calculation, and honest acknowledgment of limitations.
Understanding Output from Statistical Software
When you run this analysis in R, Python, SPSS, or Stata, the software produces detailed output with more numbers than you need for any single analysis. Knowing which numbers are essential (test statistic, df, p-value, CI, effect size) vs. diagnostic vs. supplementary is a critical skill. Our calculator extracts the key results and presents them in a clear, interpretable format — but understanding what each number means, where it comes from, and what would make it change is what separates a statistician from a button-pusher.
Integrating Multiple Analyses
Real research rarely involves a single statistical test in isolation. Typically, a full analysis includes: (1) data quality checks and outlier investigation, (2) descriptive statistics for all key variables, (3) visualization of distributions and relationships, (4) assumption verification for planned inferential tests, (5) primary inferential analysis with effect size and CI, (6) sensitivity analyses testing robustness to assumption violations, and (7) subgroup analyses if pre-specified. This holistic approach produces more trustworthy and complete results than any single test alone.
Statistical Software Commands Reference
For those implementing these analyses computationally: R provides comprehensive implementations through base R and packages like stats, car, lme4, and ggplot2 for visualization. Python users rely on scipy.stats, statsmodels, and pingouin for statistical testing. Both languages offer excellent power analysis tools (R: pwr package; Python: statsmodels.stats.power). SPSS and Stata provide menu-driven interfaces alongside powerful command syntax for reproducible analyses. Learning at least one of these tools is essential for any applied statistician or data scientist.