Missing Data¶
1. Why this matters¶
Real data has gaps. Customers leave fields blank, sensors drop readings, dates are unrecorded. Most sklearn models can't handle NaN — you must decide what to do before training:
- Drop rows → smaller dataset, biased if missingness isn't random.
- Drop columns → throws away signal.
- Impute → keeps everything; choice of imputer affects accuracy.
The right strategy depends on why values are missing (MCAR, MAR, MNAR) and how much (5%? 50%?).
2. Mental model — three flavors of missingness¶
| Type | Meaning | Example | Action |
|---|---|---|---|
| MCAR — Missing Completely At Random | Missingness independent of everything | Sensor randomly dropped readings | Drop or impute, both fine |
| MAR — Missing At Random | Missingness depends on OTHER observed features | Older users less likely to fill "income" | Impute, ideally model-based |
| MNAR — Missing Not At Random | Missingness depends on the missing value itself | High earners refuse to disclose income | Imputation biases; add an indicator column |
flowchart TD
A[Column has missing values] --> P[How much missing?]
P -->|> 60%| D[Drop the column]
P -->|< 5% and MCAR| R[Drop those rows]
P -->|moderate, numeric| I1[SimpleImputer median<br/>+ MissingIndicator]
P -->|moderate, categorical| I2[SimpleImputer most_frequent<br/>or sentinel 'missing']
P -->|sophisticated| I3[KNNImputer / IterativeImputer]
3. Strategy 1: Drop (rare)¶
Drop rows with any missing value (Complete Case Analysis):
df.dropna() # drop rows with ANY NaN
df.dropna(subset=["age", "income"]) # only if specific cols are NaN
df.dropna(thresh=8) # keep rows with >= 8 non-null values
Drop columns with too many missing:
Use when: - Missing rate < 5% AND data is MCAR. - Column missingness > 60% AND it's not the target.
Don't use when: - You'd lose > 10-20% of rows. - Missingness is informative (MNAR).
4. Strategy 2: SimpleImputer¶
The bread-and-butter approach.
Numeric — median (robust to outliers):
from sklearn.impute import SimpleImputer
import numpy as np
num_imputer = SimpleImputer(strategy="median")
# fit on train ONLY
num_imputer.fit(X_train[["age", "income"]])
X_train[["age", "income"]] = num_imputer.transform(X_train[["age", "income"]])
X_test [["age", "income"]] = num_imputer.transform(X_test [["age", "income"]])
# Inspect learned medians
print(num_imputer.statistics_)
Strategies:
- "mean" — sensitive to outliers; OK for symmetric distributions.
- "median" — default for numeric, robust to outliers.
- "most_frequent" — works on any dtype; OK for low-cardinality categorical.
- "constant" + fill_value=0 — explicit sentinel.
Categorical — most frequent OR explicit "missing":
cat_imputer = SimpleImputer(strategy="most_frequent")
# Or, more informative — make "missing" a category of its own
cat_imputer = SimpleImputer(strategy="constant", fill_value="missing")
Why a "missing" sentinel can beat "most_frequent": it preserves the signal that the value was missing — sometimes informative (MNAR).
5. Strategy 3: Missing-indicator¶
Add a binary column flagging which values were missing. Pairs well with any imputer.
from sklearn.impute import MissingIndicator
mi = MissingIndicator(features="all") # one binary col per feature
indicators = mi.fit_transform(X_train)
# Now concatenate to the imputed X
Inside a pipeline:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.impute import SimpleImputer, MissingIndicator
num_with_indicator = FeatureUnion([
("impute", SimpleImputer(strategy="median")),
("indicator", MissingIndicator(missing_values=np.nan)),
])
Or more cleanly, use SimpleImputer(add_indicator=True):
SimpleImputer(strategy="median", add_indicator=True)
# Output has imputed columns + one binary indicator column per imputed feature
6. Strategy 4: KNNImputer¶
Imputes each missing value using the average of its k nearest neighbors (in feature space).
from sklearn.impute import KNNImputer
knn_imp = KNNImputer(
n_neighbors=5,
weights="distance", # "uniform" or "distance"
)
X_train_imp = knn_imp.fit_transform(X_train)
X_test_imp = knn_imp.transform(X_test)
Pros: captures relationships between features; better than column-wise mean. Cons: slow on big data (O(n²) distances); needs all features scaled first.
Use when: - Mid-size dataset (< 50K rows). - Features have inter-correlations. - You already scaled your data.
7. Strategy 5: IterativeImputer (MICE-style)¶
Iteratively predicts each missing feature using a model trained on the others — like a mini supervised problem per feature, repeated until convergence.
from sklearn.experimental import enable_iterative_imputer # required import
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor
it_imp = IterativeImputer(
estimator=RandomForestRegressor(n_estimators=10),
max_iter=10,
random_state=42,
)
X_train_imp = it_imp.fit_transform(X_train)
Pros: highest-quality imputation; handles multiple missing patterns. Cons: slow; can overfit; "experimental" status (API may change).
Use when: - Accuracy matters more than speed. - Several columns have missing values that correlate. - You've already exhausted simpler imputers.
8. Putting it together — production-shape¶
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
num_cols = ["age", "income", "tenure_months"]
cat_cols = ["country", "plan_type"]
num_pipe = Pipeline([
("impute", SimpleImputer(strategy="median", add_indicator=True)),
("scale", StandardScaler()),
])
cat_pipe = Pipeline([
("impute", SimpleImputer(strategy="constant", fill_value="missing")),
("ohe", OneHotEncoder(handle_unknown="ignore")),
])
preprocess = ColumnTransformer([
("num", num_pipe, num_cols),
("cat", cat_pipe, cat_cols),
])
clf = Pipeline([
("prep", preprocess),
("model", RandomForestClassifier(random_state=42)),
])
clf.fit(X_train, y_train)
Imputation now lives inside the pipeline — leak-free, refit per CV fold, saves as one artifact.
9. Common pitfalls¶
- ❗ Imputing on the FULL dataset before splitting. Train+test stats leak into the imputer. Always inside a pipeline.
- ❗ Using
meanon skewed/outlier-heavy data.medianis safer for numeric defaults. - ❗
most_frequentfor high-cardinality categoricals. Most rows get the modal value; loses signal. Prefer a"missing"sentinel. - ❗ KNNImputer without scaling. Distances are dominated by the largest-range feature. Scale first.
- ❗ Imputing the target. If
yhas NaN, drop those rows entirely — never impute the label. - ❗ Forgetting the missing-indicator column. Sometimes missingness IS the signal (income not reported = MNAR). Use
add_indicator=TrueorMissingIndicator. - ❗ Imputing categorical as numeric (
-1) then forgetting the encoder treats it as a real category. Either fill before encoding with a sentinel string, or use an encoder that handles unknowns.
10. When to use vs not use¶
| Strategy | When |
|---|---|
| Drop rows | < 5% missing AND data is MCAR. |
| Drop columns | > 60% missing AND not predictive. |
SimpleImputer(median) |
Default for numeric. Fast, simple, hard to beat. |
SimpleImputer(constant, fill_value="missing") |
Categorical — preserves missingness as a category. |
add_indicator=True |
When missingness might be informative. |
KNNImputer |
Mid-size data with correlated features; already scaled. |
IterativeImputer |
Multiple missing columns with strong inter-relationships, willing to pay compute. |
| Just leave NaN | Tree-based models like XGBoost / LightGBM accept NaN natively. Test it! |
11. Cheatsheet¶
from sklearn.impute import (
SimpleImputer, # mean / median / most_frequent / constant
KNNImputer, # k-nearest neighbors average
MissingIndicator, # binary "was missing" mask
)
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# Common patterns
SimpleImputer(strategy="median") # numeric default
SimpleImputer(strategy="most_frequent") # cat default
SimpleImputer(strategy="constant", fill_value="missing") # explicit sentinel
SimpleImputer(strategy="median", add_indicator=True) # + missingness flag
KNNImputer(n_neighbors=5, weights="distance")
IterativeImputer(estimator=BayesianRidge(), max_iter=10)
# Pandas quick fills (NOT for production — leakage risk)
df.fillna(df.median(numeric_only=True))
df["col"].fillna(df["col"].mode()[0])
df.ffill() # forward-fill (time series)
df.interpolate(method="linear") # interpolate
# Check missingness
df.isna().sum()
df.isna().mean() # proportion per column
df.isna().sum(axis=1).hist() # rows by # of missing values
# Tree models that handle NaN natively
import xgboost as xgb
xgb.XGBClassifier() # NaN-aware
# lightgbm.LGBMClassifier() also NaN-aware
12. Q&A — recall test¶
-
Q: Three types of missingness? A: MCAR (random, unrelated to anything), MAR (depends on other observed features), MNAR (depends on the missing value itself). MNAR is the trickiest — add a missing-indicator.
-
Q: Default numeric imputer? A:
SimpleImputer(strategy="median")— robust to outliers, simple, hard to beat as a baseline. -
Q: Why
"missing"sentinel overmost_frequentfor categoricals? A: It preserves the information that the value was missing — sometimes predictive.most_frequentcollapses missing rows into the modal class, losing signal. -
Q: Why must KNNImputer be paired with scaling? A: Distances are scale-sensitive. Without scaling, the largest-range feature dominates the neighbor computation; the imputation becomes near-meaningless.
-
Q: What does
add_indicator=Truedo on aSimpleImputer? A: Adds a binary column per imputed feature flagging which rows were originally missing — captures the missingness signal alongside the imputed value. -
Q: Should you impute target labels? A: No. Drop those rows entirely. Imputing
yinvents labels and corrupts training.
Practice¶
What does this print?
Expected: 2.5
Impute medians on TRAIN only, then transform test (no fit on test)
Expected: True
Quiz — Quick check¶
What you remember
Q1. Which imputation strategy is most robust to outliers?
- mean
- median
- zero
- mode
Why: Outliers heavily skew the mean. Median is unaffected. For categorical features use
strategy="most_frequent"(the mode).
Q2. Should you ever IMPUTE the target variable y?
- No — drop those rows entirely
- Yes, with the mean
- Yes, with the median
- Sometimes
Why: Imputing the target invents labels. You'd be teaching the model on fabricated truth — a recipe for inflated training metrics and a broken model.
Q3. What does KNNImputer do?
- Drops k nearest neighbors
- Fills missing values using the average of the k nearest neighbors (in feature space)
- Clusters by k
- Same as
SimpleImputer
Why:
KNNImputeris smarter than mean/median — it uses similar rows' values. Slow on big data; useIterativeImputerfor an alternative when you need quality.
Common doubts¶
Should I always impute, or sometimes drop?
Drop rows when (a) y is missing, (b) <1% missing in a critical feature, or © random missingness with abundant data. Impute when (a) missingness is meaningful, (b) you'd lose too much data by dropping, or © the feature is critical and missing rarely.
Is 'missingness' itself a useful feature?
Yes — often. Add a boolean indicator: df["x_missing"] = df["x"].isna(). The fact that a value is missing can be predictive (lazy users, broken sensors, etc.). Then impute x separately.
How does XGBoost handle missing values natively?
XGBoost, LightGBM, and CatBoost can natively handle NaN by learning the best direction to send missing values at each split. No imputation needed — just pass NaN through. One advantage of gradient boosting libraries over sklearn's classical models.