Linear regression is one of the most powerful and widely used tools in data analysis. It models the relationship between variables and enables predictions. This guide explains everything from the basic equation to assumptions, interpretation, and practical use.
What is Linear Regression?
Linear regression fits the best straight line through your data, describing how the dependent variable (Y) changes as the independent variable (X) changes. The line is called the regression line or line of best fit.
ŷ = a + bx
Where: ŷ = predicted Y value, a = intercept (Y when X=0), b = slope (change in Y per unit change in X).
Interpreting the Slope and Intercept
Slope (b): For every 1-unit increase in X, Y changes by b units. b = 2.5 means Y increases by 2.5 for each unit increase in X. Negative b means inverse relationship.
Intercept (a): The predicted value of Y when X = 0. Sometimes not meaningful (e.g. predicted height when age = 0).
How the Line is Found — Least Squares
The regression line minimises the sum of squared residuals: SSresid = Σ(yᵢ − ŷᵢ)². Each residual (eᵢ = yᵢ − ŷᵢ) is the vertical distance from a data point to the line. Minimising their sum squared gives the most accurate line.
Understanding R² (Coefficient of Determination)
R² tells you how well your regression line fits the data: what percentage of the variation in Y is explained by X.
| R² | Interpretation | Context |
| R² = 0.95 | 95% of variance in Y explained by X | Strong fit |
| R² = 0.60 | 60% explained | Moderate fit |
| R² = 0.20 | 20% explained | Weak fit |
| R² = 0.00 | X explains nothing about Y | No linear relationship |
What counts as "good" R² depends on the field. In physics, R² > 0.99 is expected. In social sciences, R² = 0.40 can be a strong result because human behaviour is inherently variable.
Assumptions of Linear Regression
- Linearity: The relationship between X and Y is linear. Check with scatter plot.
- Independence: Observations are independent. Violated in time series (autocorrelation).
- Homoscedasticity: Residuals have constant variance across all X values. Check residuals vs fitted plot.
- Normality of residuals: Residuals are approximately normally distributed. Check Q-Q plot.
- No multicollinearity: For multiple regression — predictors should not be highly correlated with each other.
Making Predictions with Regression
After fitting the line ŷ = a + bx, plug in any X value to get a predicted Y:
Interpolation: Predicting within the range of your data — generally reliable.
Extrapolation: Predicting outside the range of your data — risky. The linear relationship may not hold beyond your data range.
When to Use Linear Regression
- You have a continuous outcome variable
- You want to quantify the relationship between X and Y
- You want to make predictions from new X values
- The relationship appears linear on a scatter plot
When the outcome is binary (yes/no), use logistic regression. When the relationship is curved, use polynomial regression or transform the variables.
Try our free Linear Regression Calculator to fit a regression line and get slope, intercept, R², and predictions instantly.
The Intuition Behind Linear Regression
Linear regression finds the line that best describes the relationship between two variables by minimising prediction errors. Imagine plotting study hours (x) against exam scores (y) for 50 students. A line through the cloud of points lets you predict a new student's score from their study hours. The "best" line minimises the total squared distance from each point to the line — this is the Ordinary Least Squares (OLS) criterion.
The Simple Linear Regression Model
The model is: y = β₀ + β₁x + ε, where β₀ is the y-intercept (predicted y when x=0), β₁ is the slope (change in y per unit increase in x), and ε is the error term (random noise). The OLS estimates are:
b₁ = Σ(xᵢ−x̄)(yᵢ−ȳ) / Σ(xᵢ−x̄)² = Cov(X,Y) / Var(X)
b₀ = ȳ − b₁x̄
Interpreting R² (Coefficient of Determination)
R² measures the proportion of variance in y explained by x. R² = 1 − (SS_residual / SS_total). An R² of 0.72 means 72% of the variation in y is explained by the linear relationship with x. The remaining 28% is unexplained by the model.
R² ranges from 0 to 1. Higher is better, but context matters enormously. In social sciences, R² = 0.30 might be excellent. In engineering physics, R² = 0.99 might be expected. Never evaluate a regression solely on R² — examine residuals and consider the practical significance of the relationship.
Assumptions of OLS Regression
OLS regression has five key assumptions (the Gauss-Markov conditions):
- Linearity: The true relationship between X and Y is linear
- Independence: Observations are independent of each other
- Homoscedasticity: Variance of residuals is constant across all values of X
- Normality: Residuals are normally distributed (needed for inference)
- No multicollinearity: (For multiple regression) predictors are not highly correlated with each other
Diagnosing Regression with Residual Plots
Residuals (observed − predicted) contain diagnostic information. Plot residuals vs fitted values to check linearity and homoscedasticity — a random scatter with no pattern indicates assumptions are met. A funnel shape indicates heteroscedasticity; a curved pattern indicates non-linearity. A normal Q-Q plot of residuals checks normality. Outliers in a leverage plot (Cook's distance) may disproportionately influence the regression line.
Multiple Linear Regression
Multiple regression extends simple regression to include multiple predictors: y = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ + ε. Each coefficient βᵢ represents the change in y per unit increase in xᵢ, holding all other predictors constant. This "all else equal" interpretation is powerful but requires careful consideration of multicollinearity.
Model selection involves choosing which predictors to include. Forward selection adds predictors one by one; backward elimination removes them. Stepwise methods combine both. Modern approaches use regularisation (Ridge, Lasso regression) that shrinks coefficients toward zero, preventing overfitting especially in high-dimensional data.
Practical Example: Housing Price Prediction
A real estate analyst fits a multiple regression predicting house price (in thousands) from size (sq ft), bedrooms, and distance to city centre (km):
Price = 50 + 0.12×Size + 15×Bedrooms − 8×Distance
Interpretation: Each additional square foot adds $120 (controlling for bedrooms and distance). Each additional bedroom adds $15,000. Each additional kilometre from the city reduces price by $8,000. A 2,000 sq ft, 3-bedroom house 5 km from the city is predicted at: 50 + 0.12×2000 + 15×3 − 8×5 = 50 + 240 + 45 − 40 = $295,000.
The Intuition Behind Linear Regression
Linear regression finds the line that best describes the relationship between two variables by minimising prediction errors. Imagine plotting study hours (x) against exam scores (y) for 50 students. A line through the cloud of points lets you predict a new student's score from their study hours. The "best" line minimises the total squared distance from each point to the line — this is the Ordinary Least Squares (OLS) criterion.
The Simple Linear Regression Model
The model is: y = β₀ + β₁x + ε, where β₀ is the y-intercept (predicted y when x=0), β₁ is the slope (change in y per unit increase in x), and ε is the error term (random noise). The OLS estimates are:
b₁ = Σ(xᵢ−x̄)(yᵢ−ȳ) / Σ(xᵢ−x̄)² = Cov(X,Y) / Var(X)
b₀ = ȳ − b₁x̄
Interpreting R² (Coefficient of Determination)
R² measures the proportion of variance in y explained by x. R² = 1 − (SS_residual / SS_total). An R² of 0.72 means 72% of the variation in y is explained by the linear relationship with x. The remaining 28% is unexplained by the model.
R² ranges from 0 to 1. Higher is better, but context matters enormously. In social sciences, R² = 0.30 might be excellent. In engineering physics, R² = 0.99 might be expected. Never evaluate a regression solely on R² — examine residuals and consider the practical significance of the relationship.
Assumptions of OLS Regression
OLS regression has five key assumptions (the Gauss-Markov conditions):
- Linearity: The true relationship between X and Y is linear
- Independence: Observations are independent of each other
- Homoscedasticity: Variance of residuals is constant across all values of X
- Normality: Residuals are normally distributed (needed for inference)
- No multicollinearity: (For multiple regression) predictors are not highly correlated with each other
Diagnosing Regression with Residual Plots
Residuals (observed − predicted) contain diagnostic information. Plot residuals vs fitted values to check linearity and homoscedasticity — a random scatter with no pattern indicates assumptions are met. A funnel shape indicates heteroscedasticity; a curved pattern indicates non-linearity. A normal Q-Q plot of residuals checks normality. Outliers in a leverage plot (Cook's distance) may disproportionately influence the regression line.
Multiple Linear Regression
Multiple regression extends simple regression to include multiple predictors: y = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ + ε. Each coefficient βᵢ represents the change in y per unit increase in xᵢ, holding all other predictors constant. This "all else equal" interpretation is powerful but requires careful consideration of multicollinearity.
Model selection involves choosing which predictors to include. Forward selection adds predictors one by one; backward elimination removes them. Stepwise methods combine both. Modern approaches use regularisation (Ridge, Lasso regression) that shrinks coefficients toward zero, preventing overfitting especially in high-dimensional data.
Practical Example: Housing Price Prediction
A real estate analyst fits a multiple regression predicting house price (in thousands) from size (sq ft), bedrooms, and distance to city centre (km):
Price = 50 + 0.12×Size + 15×Bedrooms − 8×Distance
Interpretation: Each additional square foot adds $120 (controlling for bedrooms and distance). Each additional bedroom adds $15,000. Each additional kilometre from the city reduces price by $8,000. A 2,000 sq ft, 3-bedroom house 5 km from the city is predicted at: 50 + 0.12×2000 + 15×3 − 8×5 = 50 + 240 + 45 − 40 = $295,000.
Complete Worked Example: Predicting Student Exam Scores
A professor collects data on study hours (x) and final exam scores (y) for 8 students: (2,50), (3,55), (4,65), (5,70), (6,75), (7,80), (8,85), (9,90).
x̄ = 5.5, ȳ = 71.25. Sxy = Σ(xᵢ−x̄)(yᵢ−ȳ) = (2−5.5)(50−71.25)+...= (−3.5)(−21.25)+(−2.5)(−16.25)+(−1.5)(−6.25)+(−0.5)(−1.25)+(0.5)(3.75)+(1.5)(8.75)+(2.5)(13.75)+(3.5)(18.75) = 74.375+40.625+9.375+0.625+1.875+13.125+34.375+65.625 = 240. Sxx = Σ(xᵢ−x̄)² = 12.25+6.25+2.25+0.25+0.25+2.25+6.25+12.25 = 42.
b₁ = 240/42 = 5.714. b₀ = 71.25 − 5.714×5.5 = 71.25 − 31.43 = 39.82. Equation: ŷ = 39.82 + 5.71x.
Prediction: a student studying 10 hours → ŷ = 39.82 + 5.71×10 = 96.9. R² = Sxy²/(Sxx×Syy) ≈ 0.987 — study hours explain 98.7% of variance in exam scores. This remarkably high R² reflects the clean, near-linear simulated relationship.
Diagnosing Problems: Real Residual Plots
In a real dataset predicting house prices from size: plotting residuals vs fitted values reveals a fan shape (residuals increase with fitted value) — classic heteroscedasticity. Larger houses have more variable prices (a 3,000 sq ft house can range from $400K to $1.2M depending on neighbourhood). Solution: log-transform the price variable. After log transformation, residuals scatter randomly around zero with constant variance — the assumption is met. The model becomes: log(price) = b₀ + b₁×size, and b₁ is now interpreted as the percentage change in price per square foot, a more natural economic interpretation.
Calculate Instantly — 100% Free
45 statistics calculators with step-by-step solutions, interactive charts, and PDF export. No sign-up needed.
▶ Open Free Statistics Calculator
Deep Dive: Linear Regression Guide — Theory, Assumptions, and Best Practices
This section provides a comprehensive look at the Linear Regression Guide — covering the mathematical theory, step-by-step worked examples, complete assumptions checking, effect size reporting, common mistakes, and real-world applications that go beyond introductory coverage.
Mathematical Foundation
Every statistical procedure rests on a mathematical model of how data is generated. The Linear Regression Guide assumes specific data-generating conditions that, when satisfied, guarantee the stated Type I error rate and power. Understanding these foundations helps you know when results are trustworthy and when to seek alternatives.
Assumptions and Diagnostics
Before interpreting any result, verify all assumptions are satisfied. Common assumption violations and their remedies:
- Non-normality: For small samples, use non-parametric alternatives or bootstrap methods. For large samples, the Central Limit Theorem typically provides robustness.
- Outliers: Identify using IQR fence or modified z-scores. Investigate each outlier — correct data errors, but do not delete genuine extreme observations without disclosure.
- Independence violations: Clustered or longitudinal data requires mixed models or GEE rather than standard methods assuming independence.
Interpreting Your Results Completely
A complete interpretation always includes: (1) the test statistic value, (2) degrees of freedom, (3) exact p-value, (4) confidence interval for the parameter of interest, (5) effect size with interpretation, and (6) a plain-language conclusion. Never report just a p-value — it communicates only one dimension of a multi-dimensional result.
Effect Size and Practical Significance
Statistical significance tells you that an effect is detectable; effect size tells you whether it matters. For every test, compute and report the appropriate effect size measure alongside the p-value. Use field-specific benchmarks (not just Cohen's generic small/medium/large) to evaluate practical significance.
Common Errors and How to Avoid Them
- Multiple testing without correction: Apply Bonferroni, Holm, or FDR corrections whenever running more than one test on the same dataset.
- Confusing statistical and practical significance: Always ask "is this large enough to matter?" not just "is this detectable?"
- p-hacking: Pre-register hypotheses, analysis plans, and significance thresholds before seeing data.
- Overlooking assumptions: Verify independence, normality (or large n), and homogeneity of variance before applying parametric tests.
When This Test Is Not Appropriate
Every test has boundaries of appropriate application. Understand when to use non-parametric alternatives, when to switch to more complex models, and when the research question requires a different analytic framework entirely. Using the wrong test produces incorrect Type I error rates and power — even if the computation is done correctly.
Reporting in Academic and Professional Contexts
Follow APA 7th edition reporting format for academic publications: report the test statistic with its symbol (t, F, χ², z), degrees of freedom in parentheses, exact p-value to two or three decimal places, and confidence intervals. Example: "A one-sample t-test indicated that study time significantly exceeded the 10-hour benchmark, t(23) = 2.84, p = .009, d = 0.58, 95% CI [10.7, 13.2]."