Skip to content

Outliers

1. Why this matters

A handful of extreme rows can: - Shift the mean by 50% (one 5M income vs 50k median). - Make MinMaxScaler compress everyone else to a sliver. - Tank a linear regression's R² by 30%. - Distort visualizations.

Tree-based models are largely immune. Linear, distance-based (KNN, K-means), and neural network models are not.

2. Mental model

Outliers come in three flavors:

Type What Example
Univariate Extreme in ONE column age = 250
Multivariate Unusual combination age = 5 AND salary = $200k
Contextual Extreme only in context 30 °C — fine in summer, extreme in winter
flowchart LR
    A[Detect] --> B{Investigate}
    B -->|"data error / typo"| C[Fix or remove]
    B -->|"genuine extreme value"| D[Decide: keep, cap, transform]
    B -->|"the most interesting case"| E["Keep — but model robustly"]

The default reflex of "drop all outliers" is usually wrong. Investigate first.

3. Method 1: Z-score (assumes Gaussian)

For each value, how many standard deviations from the mean?

z = (x - mean) / std

Flag |z| > 3 (covers 99.7% of a Gaussian; ~0.3% beyond is anomalous).

import pandas as pd
import numpy as np
from scipy import stats

# Per-column z-scores
z = np.abs(stats.zscore(df["income"], nan_policy="omit"))
outliers = df[z > 3]
clean = df[z <= 3]

# Multiple columns at once
num = df.select_dtypes("number")
z_all = np.abs(stats.zscore(num, nan_policy="omit"))
mask = (z_all < 3).all(axis=1)
clean = df[mask]

Pros: Simple, fast. Cons: Assumes ~Gaussian distribution. Bad for skewed data (mean and std are themselves outlier-sensitive!).

Use when: Column is roughly normally distributed.

4. Method 2: IQR (interquartile range) — robust default

Q1 = 25th percentile,  Q3 = 75th percentile,  IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

Anything outside [lower, upper] is an outlier.

def iqr_bounds(s, k=1.5):
    q1, q3 = s.quantile([0.25, 0.75])
    iqr = q3 - q1
    return q1 - k * iqr, q3 + k * iqr

lo, hi = iqr_bounds(df["income"])
outliers = df[(df["income"] < lo) | (df["income"] > hi)]
clean    = df[(df["income"] >= lo) & (df["income"] <= hi)]

Pros: Robust — uses median + percentiles, not mean+std. Cons: With heavy-tailed data, can flag too many points.

Use when: Distribution is unknown or skewed (most real data). This is the recommended default.

5. Method 3: Percentile / Winsorization

Cap values at the 1st and 99th percentiles instead of removing them — preserves row count.

def winsorize(s, lower=0.01, upper=0.99):
    lo, hi = s.quantile([lower, upper])
    return s.clip(lo, hi)

df["income_w"] = winsorize(df["income"])

Or with scipy:

from scipy.stats.mstats import winsorize
df["income_w"] = winsorize(df["income"], limits=[0.01, 0.01])

Pros: No data loss, just clipped. Cons: Creates artificial mass at the cap values — distorts distribution shape.

Use when: You want to retain all rows but limit extreme influence; common in pricing/financial data.

6. Method 4: Log / Power transform (often the real answer)

Many "outliers" are just heavy right tails — a log or Yeo-Johnson transform pulls them in:

import numpy as np
from sklearn.preprocessing import PowerTransformer

df["income_log"] = np.log1p(df["income"])           # log(1+x), handles zeros

pt = PowerTransformer(method="yeo-johnson")
df["income_pt"] = pt.fit_transform(df[["income"]])

After log transform, the IQR check often finds zero outliers. Often the cleaner fix than removal.

7. Multivariate detection — IsolationForest

When unusualness is about combinations, not single columns:

from sklearn.ensemble import IsolationForest

iso = IsolationForest(contamination=0.01, random_state=42)
iso.fit(X_train)
outlier_flag = iso.predict(X_train) == -1     # -1 = outlier

contamination is your prior on the proportion of outliers (1% is conservative).

For density-based detection, also LocalOutlierFactor (unsupervised) and OneClassSVM.

8. Code — end-to-end pattern

Robust pre-modeling cleanup:

import pandas as pd
import numpy as np

def clean_outliers_iqr(df, cols, k=1.5):
    """Drop rows outside Q1-k*IQR .. Q3+k*IQR for any specified column."""
    mask = pd.Series(True, index=df.index)
    for c in cols:
        q1, q3 = df[c].quantile([0.25, 0.75])
        iqr = q3 - q1
        mask &= df[c].between(q1 - k*iqr, q3 + k*iqr)
    return df[mask]

# 1. Inspect first
df["income"].describe()
df.boxplot(column="income")

# 2. Decide: investigate the extremes
top_outliers = df.nlargest(10, "income")
# eyeball — are these real? data errors? important customers?

# 3. Apply
clean = clean_outliers_iqr(df, cols=["income", "monthly_charges"])

# 4. Or transform instead of remove
df["income"] = np.log1p(df["income"])

Inside a pipeline (rare — usually outlier handling happens before the pipeline):

from sklearn.preprocessing import FunctionTransformer

def winsorize_array(X, lower=0.01, upper=0.99):
    X = np.asarray(X, dtype=float)
    lo = np.nanquantile(X, lower, axis=0)
    hi = np.nanquantile(X, upper, axis=0)
    return np.clip(X, lo, hi)

winsorizer = FunctionTransformer(winsorize_array, validate=False)

9. Common pitfalls

  • Delete-first-ask-later. A 5σ income could be a billionaire CEO or a typo — both demand investigation, not silent deletion.
  • Z-score on skewed data. Mean + std are themselves dominated by outliers → z-scores become meaningless. Use IQR.
  • Removing outliers from the TEST set. Production won't have that luxury. Apply cleanup logic to train only OR cap to a fixed threshold learned on train.
  • Outlier removal AFTER scaling. Order matters: outlier handling → scaling, not the reverse.
  • Treating every tail point as an outlier. Power-law distributions (income, network hops, file sizes) naturally have heavy tails. Log/power transform fixes them; deletion damages them.
  • Outliers in the target y. Removing them is essentially refusing to predict hard cases. Often the bigger win is a robust loss (e.g., Huber for regression).

10. When to use vs not use

Method When
IQR removal Default. Robust, simple, works on any distribution.
Z-score Approximately Gaussian columns.
Winsorization Want to keep row count but cap extremes (common in finance).
Log / PowerTransformer Right-skewed data — usually a cleaner fix than removing.
IsolationForest / LOF Need to flag UNUSUAL COMBINATIONS, not single-column extremes.
Leave them alone Tree-based models (RF, GBM, XGBoost) — they handle outliers fine.
Investigate, don't remove Anytime "outliers" are < 0.5% of data — likely meaningful.

11. Cheatsheet

# Quick descriptions
df.describe()
df.boxplot(column="x")
df["x"].quantile([0.01, 0.05, 0.95, 0.99])

# Z-score (Gaussian assumption)
from scipy import stats
z = np.abs(stats.zscore(df["x"], nan_policy="omit"))
df_clean = df[z < 3]

# IQR (robust default)
q1, q3 = df["x"].quantile([0.25, 0.75])
iqr = q3 - q1
df_clean = df[df["x"].between(q1 - 1.5*iqr, q3 + 1.5*iqr)]

# Winsorization
df["x_w"] = df["x"].clip(*df["x"].quantile([0.01, 0.99]))

# Log / power transform
df["x_log"] = np.log1p(df["x"])
from sklearn.preprocessing import PowerTransformer
df["x_pt"] = PowerTransformer().fit_transform(df[["x"]])

# Multivariate
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
flags = IsolationForest(contamination=0.01).fit_predict(X)
flags = LocalOutlierFactor(n_neighbors=20).fit_predict(X)

12. Q&A — recall test

  • Q: Why is IQR more reliable than Z-score for outlier detection? A: IQR uses median and percentiles — robust to extreme values. Z-score uses mean and std, which are themselves pulled by the outliers you're trying to detect.

  • Q: When is removal NOT the right answer? A: When the "outlier" is real (a genuine extreme case) or carries the signal you care about (fraud detection thrives on outliers!). Investigate first.

  • Q: A column has a heavy right tail and many "outliers" by IQR. What's a smarter move than deletion? A: Apply a log or Yeo-Johnson power transform. The tail becomes the normal range; IQR finds zero outliers; no data lost.

  • Q: Should you apply outlier removal to the test set? A: No — production won't have that luxury. Either cap to a fixed bound learned on train, or accept the model's behavior on extremes.

  • Q: Tree-based models and outliers? A: Generally indifferent. Splits on a single feature aren't affected by how far an extreme value is — it just becomes its own region. Don't bother removing for RF / XGBoost / LightGBM.

  • Q: Difference between IsolationForest and IQR? A: IQR is per-column (univariate). IsolationForest considers feature interactions (multivariate) — can flag a row that's normal in every column but unusual in combination.

Practice

What does this print?

Expected: 1

import numpy as np
a = np.array([10, 12, 11, 13, 1000])
q1, q3 = np.percentile(a, [25, 75])
iqr = q3 - q1
print(((a < q1 - 1.5*iqr) | (a > q3 + 1.5*iqr)).sum())   # 1000 is an outlier

Clip outliers to the IQR bounds (instead of removing them)

Expected: True

import numpy as np
a = np.array([10, 12, 11, 13, 1000])
q1, q3 = np.percentile(a, [25, 75])
iqr = q3 - q1
lo, hi = q1 - 1.5*iqr, q3 + 1.5*iqr
clipped = a                          # bug: not actually clipping — use np.clip
print(clipped.max() <= hi)

Quiz — Quick check

What you remember

Q1. What does the IQR-based outlier rule define as an outlier?

  • Anything more than 2 std from the mean
  • Anything outside [Q1 − 1.5×IQR, Q3 + 1.5×IQR]
  • The top and bottom 1%
  • Any value > 1000

Why: Tukey's fences. Robust to non-normal distributions, doesn't require assuming a Gaussian. The 1.5 factor is convention; 3 gives "extreme" outliers.

Q2. When should you DELETE outliers vs CLIP them?

  • Always delete
  • Delete when they're data errors (impossible values); clip when they're real but extreme
  • Always clip
  • Outliers shouldn't be touched

Why: A negative age is a data error → delete. A genuinely very high income is a real outlier → clip or use a robust model. Don't lose valid data; just bound it.

Q3. Which model is least sensitive to outliers?

  • Linear Regression
  • Logistic Regression
  • Tree-based models (Random Forest, XGBoost)
  • k-NN

Why: Trees split on individual feature values — outliers end up in their own leaf and don't pollute predictions for other rows. Linear models can have their coefficients dragged dramatically by even a few extreme points.

Common doubts

Should I always remove outliers?

No. Many "outliers" are real, valuable signals (fraud detection literally targets them). Remove only when you're confident they're data quality issues. For robust modeling, use models that handle outliers well (trees, RANSAC) rather than removing them upfront.

How is IsolationForest different from IQR?

IQR works per column (univariate). IsolationForest considers all features together (multivariate). A row could be normal on every individual feature but unusual in combination — IQR misses these; IsolationForest catches them.

Why does my linear regression score drop after removing outliers?

Possibly because removing outliers also removed informative variance, or because the "outliers" were valid points that the model should learn. Try RobustScaler + a regularized linear model (Ridge) instead of removal.