Outlier Detection Methods — IQR, Z-Score & When to Use

An outlier is a data point that differs significantly from other observations. Outliers can be errors, rare genuine events, or the most interesting data points in your dataset. How you handle them dramatically affects your analysis — so detection must come before the decision to remove.

Why Outliers Matter

Distort the mean — even a single outlier can shift the mean dramatically
Inflate standard deviation — outliers increase variability estimates
Affect regression lines — high-leverage outliers can rotate the entire regression line
Violate normality assumptions — outliers cause non-normal residuals in regression and t-tests

Method 1: IQR Fence Method (Tukey's Method)

The most widely used and intuitive outlier detection method:

Lower fence = Q1 − 1.5 × IQR
Upper fence = Q3 + 1.5 × IQR

Any value below the lower fence or above the upper fence is a potential outlier. Using 3×IQR gives extreme outlier fences.

Example: Data: 10, 12, 14, 15, 17, 18, 19, 20, 65. Q1=13, Q3=19, IQR=6. Lower fence = 13 − 9 = 4. Upper fence = 19 + 9 = 28. Value 65 > 28 → outlier detected.

Advantages: Robust (uses median-based quartiles, not the mean). Works well for skewed data. Standard in box plots.

Method 2: Z-Score Method

z = (x − x̄) / s

Values with |z| > 2 are potential outliers. Values with |z| > 3 are extreme outliers (occur only 0.3% of the time in normal data).

Limitations: The z-score method uses the mean and SD — which are themselves affected by outliers. In small samples, extreme values cannot achieve |z| > 3 mathematically. Better for large, approximately normal datasets.

Method 3: Modified Z-Score (Iglewicz-Hoaglin)

M = 0.6745 × (xᵢ − median) / MAD

Where MAD = Median Absolute Deviation. More robust than standard z-score because it uses the median instead of the mean. Values with |M| > 3.5 are outliers.

Method 4: Grubbs' Test

A formal statistical test for whether the most extreme value in a dataset is a statistical outlier. Tests one outlier at a time. Assumes approximately normal data. Available in many statistical software packages.

Should You Remove Outliers?

This is the most important and most mishandled question. The answer depends entirely on WHY the outlier exists:

Remove when:

The outlier is due to a clear data entry error or measurement error
The outlier belongs to a different population than your study group
Equipment malfunction caused the extreme value

Keep when:

The outlier is a genuine, valid observation — even if unusual
Removing it would distort the true picture (e.g. in fraud detection, outliers ARE the signal)
You are studying rare events where outliers are the point

Never:

Remove outliers just to get a significant p-value
Remove outliers without documenting and justifying the decision
Automatically remove all values beyond 2 SDs without investigation

Use our free Outlier Detection Calculator to identify outliers using both IQR and Z-score methods simultaneously.

What is an Outlier?

An outlier is an observation that lies an unusual distance from other values in the dataset. The challenging part is that "unusual" depends on context and the definition you apply. Outliers can arise from measurement error (transcription mistakes, equipment malfunction), data entry errors, genuine extreme values from the natural distribution, or observations from a different population that were mistakenly included. The correct response to an outlier depends entirely on its cause.

The IQR Fence Method (Tukey's Method)

The most widely used outlier detection rule defines fences using the interquartile range. The lower fence = Q1 − 1.5×IQR and upper fence = Q3 + 1.5×IQR. Values beyond these fences are flagged as potential outliers. Values beyond Q1 − 3×IQR or Q3 + 3×IQR are "far outliers." This method is robust — it uses medians and quartiles, which are themselves not influenced by outliers — and works for any distribution shape.

The Z-Score Method

For approximately normal data, compute z-scores: z = (x − x̄)/s. Values with |z| > 3 are often flagged as outliers (only 0.3% of normal data lies beyond 3σ). The limitation is that the mean and standard deviation are themselves influenced by outliers, potentially masking extreme values. Modified z-scores using the median absolute deviation (MAD) are more robust: modified z = 0.6745(x − median)/MAD.

Isolation Forest for High-Dimensional Data

Classical univariate methods fail in high-dimensional data where outliers may only be unusual in combination of features, not in any single dimension. The Isolation Forest algorithm detects outliers by randomly partitioning data and measuring how quickly each point is isolated — outliers are isolated in fewer partitions. This machine learning approach scales well and requires no distributional assumptions, making it powerful for fraud detection, network intrusion detection, and sensor anomaly detection.

DBSCAN: Density-Based Outlier Detection

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) identifies clusters of high-density regions and labels points in low-density regions as outliers. Points that cannot be assigned to any cluster are "noise points" — effectively outliers. This approach handles arbitrary cluster shapes and is effective for spatial data, manufacturing sensor data, and any application where outliers are isolated points in otherwise dense regions.

What to Do With Outliers

Detection is only the first step. The appropriate response depends on the outlier's cause: data entry errors should be corrected or removed; measurement errors should be investigated and corrected; genuine extreme values should be retained (they are real observations); observations from a different population may be removed with clear justification and transparent reporting.

Never silently remove outliers. Always document which observations were removed, why, and conduct sensitivity analyses showing results both with and without outliers. An analysis that only "works" when outliers are removed is fragile and potentially misleading.

What is an Outlier?

The IQR Fence Method (Tukey's Method)

The Z-Score Method

Isolation Forest for High-Dimensional Data

DBSCAN: Density-Based Outlier Detection

What to Do With Outliers

Complete Example: Detecting Outliers in Manufacturing Data

A factory measures the diameter of steel rods (in mm). Sample of 20 measurements: 50.1, 50.2, 49.9, 50.0, 50.3, 50.1, 49.8, 50.2, 50.0, 50.1, 50.2, 49.9, 50.0, 50.1, 50.3, 50.0, 50.1, 50.2, 49.9, 57.3.

Z-score method: Mean = 50.54, SD = 1.61. For 57.3: z = (57.3 − 50.54)/1.61 = 4.20. Since |z| > 3, flagged as outlier. But notice the mean and SD are distorted by the outlier itself.

Modified Z-score (robust): Median = 50.10, MAD = 0.10. Modified z = 0.6745 × (57.3 − 50.10)/0.10 = 48.6. Massively flagged. This robust method is not fooled by the outlier contaminating its own detection statistics.

IQR method: Q1 = 49.95, Q3 = 50.20, IQR = 0.25. Upper fence = 50.20 + 1.5×0.25 = 50.575. Since 57.3 > 50.575, flagged. Investigation reveals this rod was measured after a calibration error. The value is corrected, not deleted, preserving data integrity.

Outliers in Regression: Leverage and Influence

In regression analysis, outliers have different impacts depending on their location. A point with high leverage is far from the mean of X — it has the potential to strongly influence the regression line but may not actually do so if it falls on the line. An influential point (high Cook's distance) actually changes the regression coefficients substantially when removed. High leverage + poor fit = high influence. Always plot Cook's distance and studentised residuals to identify influential observations. A single influential outlier can completely change the slope of a regression line, leading to entirely wrong predictions for most of the data range.

Handling Outliers in Time Series

Time series data presents unique outlier challenges. Additive outliers affect a single point (a flash crash in stock prices). Innovation outliers propagate effects forward through the series (a factory shutdown affecting all subsequent production). Level shifts permanently alter the series mean (a policy change). Seasonal outliers affect only specific calendar periods (holiday sales spikes). Each type requires different treatment: additive outliers can be interpolated, level shifts need dummy variables, innovation outliers may indicate model misspecification. The tsoutliers package in R provides automated detection and classification for time series outliers.

Calculate Instantly — 100% Free

45 statistics calculators with step-by-step solutions, interactive charts, and PDF export. No sign-up needed.

▶ Open Free Statistics Calculator

🔗 Related Resources

Data Analysis Outlier Detection Calculator → Data Analysis Descriptive Statistics Calculator → Data Analysis Five Number Summary Calculator → All Articles Browse All Statistics Articles →

Outlier Detection Methods in Statistics

Why Outliers Matter

Method 1: IQR Fence Method (Tukey's Method)

Method 2: Z-Score Method

Method 3: Modified Z-Score (Iglewicz-Hoaglin)

Method 4: Grubbs' Test

Should You Remove Outliers?

Remove when:

Keep when:

Never:

What is an Outlier?

The IQR Fence Method (Tukey's Method)

The Z-Score Method

Isolation Forest for High-Dimensional Data

DBSCAN: Density-Based Outlier Detection

What to Do With Outliers

What is an Outlier?

The IQR Fence Method (Tukey's Method)

The Z-Score Method

Isolation Forest for High-Dimensional Data

DBSCAN: Density-Based Outlier Detection

What to Do With Outliers

Complete Example: Detecting Outliers in Manufacturing Data

Outliers in Regression: Leverage and Influence

Handling Outliers in Time Series

Calculate Instantly — 100% Free

Deep Dive: Outlier Detection Methods — Theory, Assumptions, and Best Practices

Mathematical Foundation

Assumptions and Diagnostics

Interpreting Your Results Completely

Effect Size and Practical Significance

Common Errors and How to Avoid Them

When This Test Is Not Appropriate

Reporting in Academic and Professional Contexts