The p-value is one of the most used โ and most misunderstood โ concepts in all of statistics. Every research paper, clinical trial, and data science project uses it. Yet most people who use p-values daily cannot accurately define what they mean. This guide explains p-values clearly, with no confusing jargon.
The Simple Definition
A p-value is the probability of getting results at least as extreme as your observed data, assuming the null hypothesis is true.
In plain English: "If there really were no effect, how likely is it that I would see data this unusual just by chance?"
A small p-value (close to 0) means: your data would be very unlikely if the null hypothesis were true. So maybe the null hypothesis is wrong.
A Simple Example
Imagine you flip a coin 20 times and get 16 heads. You want to test if the coin is fair (p = 0.5).
Null hypothesis Hโ: The coin is fair (p = 0.5)
Question: If the coin really is fair, how likely is it to get 16 or more heads out of 20?
Answer: P(X โฅ 16) = about 0.006 = 0.6%
This is your p-value. Since 0.006 < 0.05, you reject Hโ โ the coin is probably not fair. Getting 16 heads by chance when the coin is fair would happen only 6 times in 1,000 experiments.
What Does p < 0.05 Mean?
The threshold ฮฑ = 0.05 is a convention proposed by Ronald Fisher in the 1920s. It means:
- You accept a 5% chance of falsely concluding there is an effect when there is none (Type I error)
- If you ran 100 experiments where Hโ is true, about 5 would give p < 0.05 by chance alone
- p < 0.05 does NOT mean there is a 95% probability your conclusion is correct
| P-value range | Common interpretation | Decision (ฮฑ = 0.05) |
| p < 0.001 | Extremely significant | Reject Hโ strongly |
| 0.001 โค p < 0.01 | Highly significant | Reject Hโ |
| 0.01 โค p < 0.05 | Statistically significant | Reject Hโ |
| 0.05 โค p < 0.10 | Borderline / marginal | Fail to reject Hโ (with caution) |
| p โฅ 0.10 | Not significant | Fail to reject Hโ |
5 Common P-Value Misconceptions
Misconception 1: p = 0.03 means there is a 97% chance the result is real
Wrong. The p-value says nothing about the probability that your conclusion is correct. It only says how unusual your data would be if Hโ were true. The probability that Hโ is true given your data (the posterior probability) requires Bayesian analysis.
Misconception 2: p > 0.05 means the null hypothesis is true
Wrong. Failing to reject Hโ just means you do not have enough evidence against it. Your study might be underpowered (too small a sample) to detect a real but small effect. "Absence of evidence is not evidence of absence."
Misconception 3: A smaller p-value means a bigger effect
Wrong. P-value depends on both effect size AND sample size. A very large study can produce p < 0.0001 for a completely trivial effect. Always report effect sizes (Cohen's d, Rยฒ, odds ratio) alongside p-values.
Misconception 4: p < 0.05 means the finding is practically important
Wrong. Statistical significance โ practical significance. A drug that reduces blood pressure by 0.5 mmHg might be statistically significant in a massive trial but completely clinically irrelevant.
Misconception 5: You can "accept" Hโ if p > 0.05
Wrong. You either reject Hโ or fail to reject it. You never accept Hโ. The hypothesis test is a one-way test โ significant results are meaningful, non-significant results are inconclusive.
P-Values in Different Statistical Tests
Every statistical test produces a p-value. The test statistic varies but the interpretation is the same:
- T-test: t-statistic โ p-value from t-distribution
- ANOVA: F-statistic โ p-value from F-distribution
- Chi-square test: ฯยฒ-statistic โ p-value from chi-square distribution
- Regression: F-statistic (overall) or t-statistic (each coefficient)
- Correlation: t-statistic โ p-value testing if r = 0
The Replication Crisis and P-Values
Many published findings with p < 0.05 have failed to replicate. Reasons include: p-hacking (testing many hypotheses until one is significant), publication bias (only publishing significant results), and inadequate sample sizes.
Modern best practices go beyond the p-value: report confidence intervals, effect sizes, and consider pre-registration of hypotheses to prevent data dredging.
How to Calculate P-Values
Use our free calculators to get exact p-values instantly:
The Formal Definition of P-Value
The p-value is the probability of observing a test statistic at least as extreme as the one calculated from your sample data, assuming the null hypothesis is true. This definition is dense, so let us unpack it carefully. The null hypothesis (Hโ) is your default assumption โ for instance, that a drug has no effect, or that two groups have equal means.
When you calculate a p-value of 0.03, it means: if the null hypothesis were actually true, there would be only a 3% probability of getting results as extreme as yours by random chance alone. This is considered unlikely enough to cast doubt on the null hypothesis.
How P-Values Are Calculated
P-values come from comparing your test statistic to a theoretical probability distribution. Different tests use different distributions. A t-test uses the t-distribution, ANOVA uses the F-distribution, and chi-square tests use the chi-square distribution. Each distribution assigns probabilities to ranges of test statistic values.
For a two-tailed test, the p-value covers both extreme tails of the distribution โ values more extreme in either direction. For a one-tailed test, only one tail is considered. Two-tailed tests are more conservative and are the default choice unless you have a strong directional hypothesis.
The Significance Threshold: Why 0.05?
The conventional threshold of ฮฑ = 0.05 (5%) was introduced by Ronald Fisher in the 1920s as a rough guideline, not a sacred law. Fisher himself noted that researchers should use their own judgment. Yet the 0.05 threshold became entrenched in scientific publishing, creating an artificial boundary between "significant" and "not significant" results.
Many fields now use stricter thresholds: particle physics requires p < 0.000001 (5 sigma) before claiming a discovery; medical research often requires p < 0.01 for drug approval. The American Statistical Association has emphasised that p-values alone should not determine scientific conclusions.
What P-Values Cannot Tell You
P-values are widely misunderstood. Here is what a p-value does NOT tell you:
- It does not tell you the probability that the null hypothesis is true
- It does not tell you the probability that your results occurred by chance
- It does not measure the size or practical importance of an effect
- It does not tell you whether your study will replicate
- It does not account for whether your study was well designed
A common misinterpretation is "p = 0.04 means there is a 4% chance the null hypothesis is true." This is incorrect. The p-value is calculated assuming Hโ is true; it says nothing about the probability of Hโ itself.
P-Values and Sample Size: A Critical Relationship
One of the most important but underappreciated facts about p-values is their relationship with sample size. With a very large sample, even tiny, practically meaningless differences become statistically significant. With a small sample, even large, important differences may not reach significance.
For example, an online retailer with 1 million users might find that button colour A produces 0.001% more clicks than button colour B, with p < 0.001. The difference is statistically significant but economically trivial. Conversely, a clinical trial with 20 patients might fail to detect a genuine treatment effect simply because the study was underpowered.
Multiple Testing Problem
When you conduct multiple hypothesis tests simultaneously, the chance of getting at least one false positive increases dramatically. If you test 20 independent hypotheses each at ฮฑ = 0.05, you would expect 1 false positive on average even if all null hypotheses are true. This is why large-scale studies (genetic association studies, brain imaging research) apply corrections like Bonferroni correction (divide ฮฑ by number of tests) or false discovery rate (FDR) control.
Alternatives and Complements to P-Values
The scientific community increasingly recommends reporting confidence intervals alongside p-values. A 95% confidence interval shows the range of plausible values for the true parameter, conveying both statistical significance and practical meaningfulness. An effect size measure (Cohen's d, eta-squared, odds ratio) shows the magnitude of the effect independently of sample size.
Bayesian methods offer an alternative framework using Bayes factors, which directly compare the evidence for and against hypotheses. Unlike p-values, Bayes factors can provide evidence for the null hypothesis, which frequentist p-values cannot.
The Formal Definition of P-Value
The p-value is the probability of observing a test statistic at least as extreme as the one calculated from your sample data, assuming the null hypothesis is true. This definition is dense, so let us unpack it carefully. The null hypothesis (Hโ) is your default assumption โ for instance, that a drug has no effect, or that two groups have equal means.
When you calculate a p-value of 0.03, it means: if the null hypothesis were actually true, there would be only a 3% probability of getting results as extreme as yours by random chance alone. This is considered unlikely enough to cast doubt on the null hypothesis.
How P-Values Are Calculated
P-values come from comparing your test statistic to a theoretical probability distribution. Different tests use different distributions. A t-test uses the t-distribution, ANOVA uses the F-distribution, and chi-square tests use the chi-square distribution. Each distribution assigns probabilities to ranges of test statistic values.
For a two-tailed test, the p-value covers both extreme tails of the distribution โ values more extreme in either direction. For a one-tailed test, only one tail is considered. Two-tailed tests are more conservative and are the default choice unless you have a strong directional hypothesis.
The Significance Threshold: Why 0.05?
The conventional threshold of ฮฑ = 0.05 (5%) was introduced by Ronald Fisher in the 1920s as a rough guideline, not a sacred law. Fisher himself noted that researchers should use their own judgment. Yet the 0.05 threshold became entrenched in scientific publishing, creating an artificial boundary between "significant" and "not significant" results.
Many fields now use stricter thresholds: particle physics requires p < 0.000001 (5 sigma) before claiming a discovery; medical research often requires p < 0.01 for drug approval. The American Statistical Association has emphasised that p-values alone should not determine scientific conclusions.
What P-Values Cannot Tell You
P-values are widely misunderstood. Here is what a p-value does NOT tell you:
- It does not tell you the probability that the null hypothesis is true
- It does not tell you the probability that your results occurred by chance
- It does not measure the size or practical importance of an effect
- It does not tell you whether your study will replicate
- It does not account for whether your study was well designed
A common misinterpretation is "p = 0.04 means there is a 4% chance the null hypothesis is true." This is incorrect. The p-value is calculated assuming Hโ is true; it says nothing about the probability of Hโ itself.
P-Values and Sample Size: A Critical Relationship
One of the most important but underappreciated facts about p-values is their relationship with sample size. With a very large sample, even tiny, practically meaningless differences become statistically significant. With a small sample, even large, important differences may not reach significance.
For example, an online retailer with 1 million users might find that button colour A produces 0.001% more clicks than button colour B, with p < 0.001. The difference is statistically significant but economically trivial. Conversely, a clinical trial with 20 patients might fail to detect a genuine treatment effect simply because the study was underpowered.
Multiple Testing Problem
When you conduct multiple hypothesis tests simultaneously, the chance of getting at least one false positive increases dramatically. If you test 20 independent hypotheses each at ฮฑ = 0.05, you would expect 1 false positive on average even if all null hypotheses are true. This is why large-scale studies (genetic association studies, brain imaging research) apply corrections like Bonferroni correction (divide ฮฑ by number of tests) or false discovery rate (FDR) control.
Alternatives and Complements to P-Values
The scientific community increasingly recommends reporting confidence intervals alongside p-values. A 95% confidence interval shows the range of plausible values for the true parameter, conveying both statistical significance and practical meaningfulness. An effect size measure (Cohen's d, eta-squared, odds ratio) shows the magnitude of the effect independently of sample size.
Bayesian methods offer an alternative framework using Bayes factors, which directly compare the evidence for and against hypotheses. Unlike p-values, Bayes factors can provide evidence for the null hypothesis, which frequentist p-values cannot.
Complete Worked Example: Calculating a P-Value by Hand
A coin is suspected to be biased. You flip it 20 times and get 15 heads. What is the p-value for testing Hโ: p = 0.5 (fair coin) vs Hโ: p โ 0.5 (two-tailed)?
Under Hโ, the number of heads X ~ Binomial(20, 0.5). The p-value is the probability of getting results as extreme as 15 or more extreme in either direction. P(X โฅ 15) = P(X=15) + P(X=16) + ... + P(X=20). By symmetry, P(X โค 5) = P(X โฅ 15). p-value = 2 ร P(X โฅ 15) = 2 ร [C(20,15)ร0.5ยฒโฐ + ... + C(20,20)ร0.5ยฒโฐ] = 2 ร (15504 + 4845 + 1140 + 190 + 20 + 1)/1048576 = 2 ร 21700/1048576 โ 0.0414.
Since p = 0.041 < 0.05, reject Hโ at the 5% significance level. There is statistically significant evidence the coin is biased. However, note the 95% CI for p: [0.509, 0.908] โ the bias could be modest or large, and the sample is small. The p-value says "significant"; the CI clarifies "we are uncertain how biased."
P-Value Interpretation: Five Common Misconceptions Corrected
Statistical education research consistently finds that most people โ including researchers โ misinterpret p-values. Here are the five most dangerous misconceptions with corrections:
Misconception 1: "p = 0.04 means there is a 4% probability the null hypothesis is true." Correction: The p-value is calculated assuming Hโ is true; it cannot give the probability that Hโ is true (that requires Bayesian analysis with a prior).
Misconception 2: "p = 0.06 means the result is almost significant." Correction: There is no continuum of "almost significant." Either the pre-specified threshold is crossed or it is not. "Trending toward significance" is not a statistical concept.
Misconception 3: "A small p-value means the effect is large." Correction: p-values conflate effect size with sample size. A tiny effect can produce p < 0.001 with a large sample. Always report effect sizes.
Misconception 4: "If p > 0.05, the null hypothesis is true." Correction: Failing to reject Hโ โ evidence for Hโ. The study may simply be underpowered.
Misconception 5: "p = 0.049 and p = 0.051 are meaningfully different." Correction: The threshold is arbitrary. Both provide similar (weak) evidence against Hโ. The evidence is nearly identical; the decision flip is an artefact of the threshold convention.
Calculate Instantly โ 100% Free
45 statistics calculators with step-by-step solutions, interactive charts, and PDF export. No sign-up needed.
โถ Open Free Statistics Calculator
Deep Dive: What Is P Value Explained โ Theory, Assumptions, and Best Practices
This section provides a comprehensive look at the What Is P Value Explained โ covering the mathematical theory, step-by-step worked examples, complete assumptions checking, effect size reporting, common mistakes, and real-world applications that go beyond introductory coverage.
Mathematical Foundation
Every statistical procedure rests on a mathematical model of how data is generated. The What Is P Value Explained assumes specific data-generating conditions that, when satisfied, guarantee the stated Type I error rate and power. Understanding these foundations helps you know when results are trustworthy and when to seek alternatives.
Assumptions and Diagnostics
Before interpreting any result, verify all assumptions are satisfied. Common assumption violations and their remedies:
- Non-normality: For small samples, use non-parametric alternatives or bootstrap methods. For large samples, the Central Limit Theorem typically provides robustness.
- Outliers: Identify using IQR fence or modified z-scores. Investigate each outlier โ correct data errors, but do not delete genuine extreme observations without disclosure.
- Independence violations: Clustered or longitudinal data requires mixed models or GEE rather than standard methods assuming independence.
Interpreting Your Results Completely
A complete interpretation always includes: (1) the test statistic value, (2) degrees of freedom, (3) exact p-value, (4) confidence interval for the parameter of interest, (5) effect size with interpretation, and (6) a plain-language conclusion. Never report just a p-value โ it communicates only one dimension of a multi-dimensional result.
Effect Size and Practical Significance
Statistical significance tells you that an effect is detectable; effect size tells you whether it matters. For every test, compute and report the appropriate effect size measure alongside the p-value. Use field-specific benchmarks (not just Cohen's generic small/medium/large) to evaluate practical significance.
Common Errors and How to Avoid Them
- Multiple testing without correction: Apply Bonferroni, Holm, or FDR corrections whenever running more than one test on the same dataset.
- Confusing statistical and practical significance: Always ask "is this large enough to matter?" not just "is this detectable?"
- p-hacking: Pre-register hypotheses, analysis plans, and significance thresholds before seeing data.
- Overlooking assumptions: Verify independence, normality (or large n), and homogeneity of variance before applying parametric tests.
When This Test Is Not Appropriate
Every test has boundaries of appropriate application. Understand when to use non-parametric alternatives, when to switch to more complex models, and when the research question requires a different analytic framework entirely. Using the wrong test produces incorrect Type I error rates and power โ even if the computation is done correctly.
Reporting in Academic and Professional Contexts
Follow APA 7th edition reporting format for academic publications: report the test statistic with its symbol (t, F, ฯยฒ, z), degrees of freedom in parentheses, exact p-value to two or three decimal places, and confidence intervals. Example: "A one-sample t-test indicated that study time significantly exceeded the 10-hour benchmark, t(23) = 2.84, p = .009, d = 0.58, 95% CI [10.7, 13.2]."