Linear Regression: Complete Guide with Examples & Formula

Linear regression is one of the most powerful and widely used tools in data analysis. It models the relationship between variables and enables predictions. This guide explains everything from the basic equation to assumptions, interpretation, and practical use.

What is Linear Regression?

Linear regression fits the best straight line through your data, describing how the dependent variable (Y) changes as the independent variable (X) changes. The line is called the regression line or line of best fit.

ŷ = a + bx

Where: ŷ = predicted Y value, a = intercept (Y when X=0), b = slope (change in Y per unit change in X).

Interpreting the Slope and Intercept

Slope (b): For every 1-unit increase in X, Y changes by b units. b = 2.5 means Y increases by 2.5 for each unit increase in X. Negative b means inverse relationship.

Intercept (a): The predicted value of Y when X = 0. Sometimes not meaningful (e.g. predicted height when age = 0).

How the Line is Found — Least Squares

The regression line minimises the sum of squared residuals: SSresid = Σ(yᵢ − ŷᵢ)². Each residual (eᵢ = yᵢ − ŷᵢ) is the vertical distance from a data point to the line. Minimising their sum squared gives the most accurate line.

Understanding R² (Coefficient of Determination)

R² tells you how well your regression line fits the data: what percentage of the variation in Y is explained by X.

R²	Interpretation	Context
R² = 0.95	95% of variance in Y explained by X	Strong fit
R² = 0.60	60% explained	Moderate fit
R² = 0.20	20% explained	Weak fit
R² = 0.00	X explains nothing about Y	No linear relationship

What counts as "good" R² depends on the field. In physics, R² > 0.99 is expected. In social sciences, R² = 0.40 can be a strong result because human behaviour is inherently variable.

Assumptions of Linear Regression

Linearity: The relationship between X and Y is linear. Check with scatter plot.
Independence: Observations are independent. Violated in time series (autocorrelation).
Homoscedasticity: Residuals have constant variance across all X values. Check residuals vs fitted plot.
Normality of residuals: Residuals are approximately normally distributed. Check Q-Q plot.
No multicollinearity: For multiple regression — predictors should not be highly correlated with each other.

Making Predictions with Regression

After fitting the line ŷ = a + bx, plug in any X value to get a predicted Y:

Interpolation: Predicting within the range of your data — generally reliable.

Extrapolation: Predicting outside the range of your data — risky. The linear relationship may not hold beyond your data range.

When to Use Linear Regression

You have a continuous outcome variable
You want to quantify the relationship between X and Y
You want to make predictions from new X values
The relationship appears linear on a scatter plot

When the outcome is binary (yes/no), use logistic regression. When the relationship is curved, use polynomial regression or transform the variables.

Try our free Linear Regression Calculator to fit a regression line and get slope, intercept, R², and predictions instantly.

The Intuition Behind Linear Regression

Linear regression finds the line that best describes the relationship between two variables by minimising prediction errors. Imagine plotting study hours (x) against exam scores (y) for 50 students. A line through the cloud of points lets you predict a new student's score from their study hours. The "best" line minimises the total squared distance from each point to the line — this is the Ordinary Least Squares (OLS) criterion.

The Simple Linear Regression Model

The model is: y = β₀ + β₁x + ε, where β₀ is the y-intercept (predicted y when x=0), β₁ is the slope (change in y per unit increase in x), and ε is the error term (random noise). The OLS estimates are:

b₁ = Σ(xᵢ−x̄)(yᵢ−ȳ) / Σ(xᵢ−x̄)² = Cov(X,Y) / Var(X)

b₀ = ȳ − b₁x̄

Interpreting R² (Coefficient of Determination)

R² measures the proportion of variance in y explained by x. R² = 1 − (SS_residual / SS_total). An R² of 0.72 means 72% of the variation in y is explained by the linear relationship with x. The remaining 28% is unexplained by the model.

R² ranges from 0 to 1. Higher is better, but context matters enormously. In social sciences, R² = 0.30 might be excellent. In engineering physics, R² = 0.99 might be expected. Never evaluate a regression solely on R² — examine residuals and consider the practical significance of the relationship.

Assumptions of OLS Regression

OLS regression has five key assumptions (the Gauss-Markov conditions):

Linearity: The true relationship between X and Y is linear
Independence: Observations are independent of each other
Homoscedasticity: Variance of residuals is constant across all values of X
Normality: Residuals are normally distributed (needed for inference)
No multicollinearity: (For multiple regression) predictors are not highly correlated with each other

Diagnosing Regression with Residual Plots

Residuals (observed − predicted) contain diagnostic information. Plot residuals vs fitted values to check linearity and homoscedasticity — a random scatter with no pattern indicates assumptions are met. A funnel shape indicates heteroscedasticity; a curved pattern indicates non-linearity. A normal Q-Q plot of residuals checks normality. Outliers in a leverage plot (Cook's distance) may disproportionately influence the regression line.

Multiple Linear Regression

Multiple regression extends simple regression to include multiple predictors: y = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ + ε. Each coefficient βᵢ represents the change in y per unit increase in xᵢ, holding all other predictors constant. This "all else equal" interpretation is powerful but requires careful consideration of multicollinearity.

Model selection involves choosing which predictors to include. Forward selection adds predictors one by one; backward elimination removes them. Stepwise methods combine both. Modern approaches use regularisation (Ridge, Lasso regression) that shrinks coefficients toward zero, preventing overfitting especially in high-dimensional data.

Practical Example: Housing Price Prediction

A real estate analyst fits a multiple regression predicting house price (in thousands) from size (sq ft), bedrooms, and distance to city centre (km):

Price = 50 + 0.12×Size + 15×Bedrooms − 8×Distance

Interpretation: Each additional square foot adds $120 (controlling for bedrooms and distance). Each additional bedroom adds $15,000. Each additional kilometre from the city reduces price by $8,000. A 2,000 sq ft, 3-bedroom house 5 km from the city is predicted at: 50 + 0.12×2000 + 15×3 − 8×5 = 50 + 240 + 45 − 40 = $295,000.

The Intuition Behind Linear Regression

The Simple Linear Regression Model

b₁ = Σ(xᵢ−x̄)(yᵢ−ȳ) / Σ(xᵢ−x̄)² = Cov(X,Y) / Var(X)

b₀ = ȳ − b₁x̄

Interpreting R² (Coefficient of Determination)

Assumptions of OLS Regression

OLS regression has five key assumptions (the Gauss-Markov conditions):

Linearity: The true relationship between X and Y is linear
Independence: Observations are independent of each other
Homoscedasticity: Variance of residuals is constant across all values of X
Normality: Residuals are normally distributed (needed for inference)
No multicollinearity: (For multiple regression) predictors are not highly correlated with each other

Diagnosing Regression with Residual Plots

Multiple Linear Regression

Practical Example: Housing Price Prediction

A real estate analyst fits a multiple regression predicting house price (in thousands) from size (sq ft), bedrooms, and distance to city centre (km):

Price = 50 + 0.12×Size + 15×Bedrooms − 8×Distance

Complete Worked Example: Predicting Student Exam Scores

A professor collects data on study hours (x) and final exam scores (y) for 8 students: (2,50), (3,55), (4,65), (5,70), (6,75), (7,80), (8,85), (9,90).

x̄ = 5.5, ȳ = 71.25. Sxy = Σ(xᵢ−x̄)(yᵢ−ȳ) = (2−5.5)(50−71.25)+...= (−3.5)(−21.25)+(−2.5)(−16.25)+(−1.5)(−6.25)+(−0.5)(−1.25)+(0.5)(3.75)+(1.5)(8.75)+(2.5)(13.75)+(3.5)(18.75) = 74.375+40.625+9.375+0.625+1.875+13.125+34.375+65.625 = 240. Sxx = Σ(xᵢ−x̄)² = 12.25+6.25+2.25+0.25+0.25+2.25+6.25+12.25 = 42.

b₁ = 240/42 = 5.714. b₀ = 71.25 − 5.714×5.5 = 71.25 − 31.43 = 39.82. Equation: ŷ = 39.82 + 5.71x.

Prediction: a student studying 10 hours → ŷ = 39.82 + 5.71×10 = 96.9. R² = Sxy²/(Sxx×Syy) ≈ 0.987 — study hours explain 98.7% of variance in exam scores. This remarkably high R² reflects the clean, near-linear simulated relationship.

Diagnosing Problems: Real Residual Plots

In a real dataset predicting house prices from size: plotting residuals vs fitted values reveals a fan shape (residuals increase with fitted value) — classic heteroscedasticity. Larger houses have more variable prices (a 3,000 sq ft house can range from $400K to $1.2M depending on neighbourhood). Solution: log-transform the price variable. After log transformation, residuals scatter randomly around zero with constant variance — the assumption is met. The model becomes: log(price) = b₀ + b₁×size, and b₁ is now interpreted as the percentage change in price per square foot, a more natural economic interpretation.

Calculate Instantly — 100% Free

45 statistics calculators with step-by-step solutions, interactive charts, and PDF export. No sign-up needed.

▶ Open Free Statistics Calculator

🔗 Related Resources

Regression Analy Linear Regression Calculator → Regression Analy Pearson Correlation Calculator → Regression Analy How to Do Linear Regression → All Articles Browse All Statistics Articles →

Linear Regression — Complete Beginner Guide

What is Linear Regression?

Interpreting the Slope and Intercept

How the Line is Found — Least Squares

Understanding R² (Coefficient of Determination)

Assumptions of Linear Regression

Making Predictions with Regression

When to Use Linear Regression

The Intuition Behind Linear Regression

The Simple Linear Regression Model

Interpreting R² (Coefficient of Determination)

Assumptions of OLS Regression

Diagnosing Regression with Residual Plots

Multiple Linear Regression

Practical Example: Housing Price Prediction

The Intuition Behind Linear Regression

The Simple Linear Regression Model

Interpreting R² (Coefficient of Determination)

Assumptions of OLS Regression

Diagnosing Regression with Residual Plots

Multiple Linear Regression

Practical Example: Housing Price Prediction

Complete Worked Example: Predicting Student Exam Scores

Diagnosing Problems: Real Residual Plots

Calculate Instantly — 100% Free

Deep Dive: Linear Regression Guide — Theory, Assumptions, and Best Practices

Mathematical Foundation

Assumptions and Diagnostics

Interpreting Your Results Completely

Effect Size and Practical Significance

Common Errors and How to Avoid Them

When This Test Is Not Appropriate

Reporting in Academic and Professional Contexts