Outliers¶
1. Why this matters¶
A handful of extreme rows can:
- Shift the mean by 50% (one 5M income vs 50k median).
- Make MinMaxScaler compress everyone else to a sliver.
- Tank a linear regression's R² by 30%.
- Distort visualizations.
Tree-based models are largely immune. Linear, distance-based (KNN, K-means), and neural network models are not.
2. Mental model¶
Outliers come in three flavors:
| Type | What | Example |
|---|---|---|
| Univariate | Extreme in ONE column | age = 250 |
| Multivariate | Unusual combination | age = 5 AND salary = $200k |
| Contextual | Extreme only in context | 30 °C — fine in summer, extreme in winter |
flowchart LR
A[Detect] --> B{Investigate}
B -->|"data error / typo"| C[Fix or remove]
B -->|"genuine extreme value"| D[Decide: keep, cap, transform]
B -->|"the most interesting case"| E["Keep — but model robustly"]
The default reflex of "drop all outliers" is usually wrong. Investigate first.
3. Method 1: Z-score (assumes Gaussian)¶
For each value, how many standard deviations from the mean?
Flag |z| > 3 (covers 99.7% of a Gaussian; ~0.3% beyond is anomalous).
import pandas as pd
import numpy as np
from scipy import stats
# Per-column z-scores
z = np.abs(stats.zscore(df["income"], nan_policy="omit"))
outliers = df[z > 3]
clean = df[z <= 3]
# Multiple columns at once
num = df.select_dtypes("number")
z_all = np.abs(stats.zscore(num, nan_policy="omit"))
mask = (z_all < 3).all(axis=1)
clean = df[mask]
Pros: Simple, fast. Cons: Assumes ~Gaussian distribution. Bad for skewed data (mean and std are themselves outlier-sensitive!).
Use when: Column is roughly normally distributed.
4. Method 2: IQR (interquartile range) — robust default¶
Q1 = 25th percentile, Q3 = 75th percentile, IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
Anything outside [lower, upper] is an outlier.
def iqr_bounds(s, k=1.5):
q1, q3 = s.quantile([0.25, 0.75])
iqr = q3 - q1
return q1 - k * iqr, q3 + k * iqr
lo, hi = iqr_bounds(df["income"])
outliers = df[(df["income"] < lo) | (df["income"] > hi)]
clean = df[(df["income"] >= lo) & (df["income"] <= hi)]
Pros: Robust — uses median + percentiles, not mean+std. Cons: With heavy-tailed data, can flag too many points.
Use when: Distribution is unknown or skewed (most real data). This is the recommended default.
5. Method 3: Percentile / Winsorization¶
Cap values at the 1st and 99th percentiles instead of removing them — preserves row count.
def winsorize(s, lower=0.01, upper=0.99):
lo, hi = s.quantile([lower, upper])
return s.clip(lo, hi)
df["income_w"] = winsorize(df["income"])
Or with scipy:
from scipy.stats.mstats import winsorize
df["income_w"] = winsorize(df["income"], limits=[0.01, 0.01])
Pros: No data loss, just clipped. Cons: Creates artificial mass at the cap values — distorts distribution shape.
Use when: You want to retain all rows but limit extreme influence; common in pricing/financial data.
6. Method 4: Log / Power transform (often the real answer)¶
Many "outliers" are just heavy right tails — a log or Yeo-Johnson transform pulls them in:
import numpy as np
from sklearn.preprocessing import PowerTransformer
df["income_log"] = np.log1p(df["income"]) # log(1+x), handles zeros
pt = PowerTransformer(method="yeo-johnson")
df["income_pt"] = pt.fit_transform(df[["income"]])
After log transform, the IQR check often finds zero outliers. Often the cleaner fix than removal.
7. Multivariate detection — IsolationForest¶
When unusualness is about combinations, not single columns:
from sklearn.ensemble import IsolationForest
iso = IsolationForest(contamination=0.01, random_state=42)
iso.fit(X_train)
outlier_flag = iso.predict(X_train) == -1 # -1 = outlier
contamination is your prior on the proportion of outliers (1% is conservative).
For density-based detection, also LocalOutlierFactor (unsupervised) and OneClassSVM.
8. Code — end-to-end pattern¶
Robust pre-modeling cleanup:
import pandas as pd
import numpy as np
def clean_outliers_iqr(df, cols, k=1.5):
"""Drop rows outside Q1-k*IQR .. Q3+k*IQR for any specified column."""
mask = pd.Series(True, index=df.index)
for c in cols:
q1, q3 = df[c].quantile([0.25, 0.75])
iqr = q3 - q1
mask &= df[c].between(q1 - k*iqr, q3 + k*iqr)
return df[mask]
# 1. Inspect first
df["income"].describe()
df.boxplot(column="income")
# 2. Decide: investigate the extremes
top_outliers = df.nlargest(10, "income")
# eyeball — are these real? data errors? important customers?
# 3. Apply
clean = clean_outliers_iqr(df, cols=["income", "monthly_charges"])
# 4. Or transform instead of remove
df["income"] = np.log1p(df["income"])
Inside a pipeline (rare — usually outlier handling happens before the pipeline):
from sklearn.preprocessing import FunctionTransformer
def winsorize_array(X, lower=0.01, upper=0.99):
X = np.asarray(X, dtype=float)
lo = np.nanquantile(X, lower, axis=0)
hi = np.nanquantile(X, upper, axis=0)
return np.clip(X, lo, hi)
winsorizer = FunctionTransformer(winsorize_array, validate=False)
9. Common pitfalls¶
- ❗ Delete-first-ask-later. A 5σ income could be a billionaire CEO or a typo — both demand investigation, not silent deletion.
- ❗ Z-score on skewed data. Mean + std are themselves dominated by outliers → z-scores become meaningless. Use IQR.
- ❗ Removing outliers from the TEST set. Production won't have that luxury. Apply cleanup logic to train only OR cap to a fixed threshold learned on train.
- ❗ Outlier removal AFTER scaling. Order matters: outlier handling → scaling, not the reverse.
- ❗ Treating every tail point as an outlier. Power-law distributions (income, network hops, file sizes) naturally have heavy tails. Log/power transform fixes them; deletion damages them.
- ❗ Outliers in the target
y. Removing them is essentially refusing to predict hard cases. Often the bigger win is a robust loss (e.g., Huber for regression).
10. When to use vs not use¶
| Method | When |
|---|---|
| IQR removal | Default. Robust, simple, works on any distribution. |
| Z-score | Approximately Gaussian columns. |
| Winsorization | Want to keep row count but cap extremes (common in finance). |
| Log / PowerTransformer | Right-skewed data — usually a cleaner fix than removing. |
| IsolationForest / LOF | Need to flag UNUSUAL COMBINATIONS, not single-column extremes. |
| Leave them alone | Tree-based models (RF, GBM, XGBoost) — they handle outliers fine. |
| Investigate, don't remove | Anytime "outliers" are < 0.5% of data — likely meaningful. |
11. Cheatsheet¶
# Quick descriptions
df.describe()
df.boxplot(column="x")
df["x"].quantile([0.01, 0.05, 0.95, 0.99])
# Z-score (Gaussian assumption)
from scipy import stats
z = np.abs(stats.zscore(df["x"], nan_policy="omit"))
df_clean = df[z < 3]
# IQR (robust default)
q1, q3 = df["x"].quantile([0.25, 0.75])
iqr = q3 - q1
df_clean = df[df["x"].between(q1 - 1.5*iqr, q3 + 1.5*iqr)]
# Winsorization
df["x_w"] = df["x"].clip(*df["x"].quantile([0.01, 0.99]))
# Log / power transform
df["x_log"] = np.log1p(df["x"])
from sklearn.preprocessing import PowerTransformer
df["x_pt"] = PowerTransformer().fit_transform(df[["x"]])
# Multivariate
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
flags = IsolationForest(contamination=0.01).fit_predict(X)
flags = LocalOutlierFactor(n_neighbors=20).fit_predict(X)
12. Q&A — recall test¶
-
Q: Why is IQR more reliable than Z-score for outlier detection? A: IQR uses median and percentiles — robust to extreme values. Z-score uses mean and std, which are themselves pulled by the outliers you're trying to detect.
-
Q: When is removal NOT the right answer? A: When the "outlier" is real (a genuine extreme case) or carries the signal you care about (fraud detection thrives on outliers!). Investigate first.
-
Q: A column has a heavy right tail and many "outliers" by IQR. What's a smarter move than deletion? A: Apply a log or Yeo-Johnson power transform. The tail becomes the normal range; IQR finds zero outliers; no data lost.
-
Q: Should you apply outlier removal to the test set? A: No — production won't have that luxury. Either cap to a fixed bound learned on train, or accept the model's behavior on extremes.
-
Q: Tree-based models and outliers? A: Generally indifferent. Splits on a single feature aren't affected by how far an extreme value is — it just becomes its own region. Don't bother removing for RF / XGBoost / LightGBM.
-
Q: Difference between IsolationForest and IQR? A: IQR is per-column (univariate). IsolationForest considers feature interactions (multivariate) — can flag a row that's normal in every column but unusual in combination.
Practice¶
What does this print?
Expected: 1
Clip outliers to the IQR bounds (instead of removing them)
Expected: True
Quiz — Quick check¶
What you remember
Q1. What does the IQR-based outlier rule define as an outlier?
- Anything more than 2 std from the mean
- Anything outside
[Q1 − 1.5×IQR, Q3 + 1.5×IQR] - The top and bottom 1%
- Any value > 1000
Why: Tukey's fences. Robust to non-normal distributions, doesn't require assuming a Gaussian. The 1.5 factor is convention; 3 gives "extreme" outliers.
Q2. When should you DELETE outliers vs CLIP them?
- Always delete
- Delete when they're data errors (impossible values); clip when they're real but extreme
- Always clip
- Outliers shouldn't be touched
Why: A negative age is a data error → delete. A genuinely very high income is a real outlier → clip or use a robust model. Don't lose valid data; just bound it.
Q3. Which model is least sensitive to outliers?
- Linear Regression
- Logistic Regression
- Tree-based models (Random Forest, XGBoost)
- k-NN
Why: Trees split on individual feature values — outliers end up in their own leaf and don't pollute predictions for other rows. Linear models can have their coefficients dragged dramatically by even a few extreme points.
Common doubts¶
Should I always remove outliers?
No. Many "outliers" are real, valuable signals (fraud detection literally targets them). Remove only when you're confident they're data quality issues. For robust modeling, use models that handle outliers well (trees, RANSAC) rather than removing them upfront.
How is IsolationForest different from IQR?
IQR works per column (univariate). IsolationForest considers all features together (multivariate). A row could be normal on every individual feature but unusual in combination — IQR misses these; IsolationForest catches them.
Why does my linear regression score drop after removing outliers?
Possibly because removing outliers also removed informative variance, or because the "outliers" were valid points that the model should learn. Try RobustScaler + a regularized linear model (Ridge) instead of removal.