Missing Data¶

1. Why this matters¶

Real data has gaps. Customers leave fields blank, sensors drop readings, dates are unrecorded. Most sklearn models can't handle NaN — you must decide what to do before training:

Drop rows → smaller dataset, biased if missingness isn't random.
Drop columns → throws away signal.
Impute → keeps everything; choice of imputer affects accuracy.

The right strategy depends on why values are missing (MCAR, MAR, MNAR) and how much (5%? 50%?).

2. Mental model — three flavors of missingness¶

Type	Meaning	Example	Action
MCAR — Missing Completely At Random	Missingness independent of everything	Sensor randomly dropped readings	Drop or impute, both fine
MAR — Missing At Random	Missingness depends on OTHER observed features	Older users less likely to fill "income"	Impute, ideally model-based
MNAR — Missing Not At Random	Missingness depends on the missing value itself	High earners refuse to disclose income	Imputation biases; add an indicator column

flowchart TD
    A[Column has missing values] --> P[How much missing?]
    P -->|> 60%| D[Drop the column]
    P -->|< 5% and MCAR| R[Drop those rows]
    P -->|moderate, numeric| I1[SimpleImputer median<br/>+ MissingIndicator]
    P -->|moderate, categorical| I2[SimpleImputer most_frequent<br/>or sentinel 'missing']
    P -->|sophisticated| I3[KNNImputer / IterativeImputer]

3. Strategy 1: Drop (rare)¶

Drop rows with any missing value (Complete Case Analysis):

df.dropna()                          # drop rows with ANY NaN
df.dropna(subset=["age", "income"])  # only if specific cols are NaN
df.dropna(thresh=8)                  # keep rows with >= 8 non-null values

Drop columns with too many missing:

missing_pct = df.isna().mean()
df = df.loc[:, missing_pct < 0.6]   # keep cols with < 60% missing

Use when: - Missing rate < 5% AND data is MCAR. - Column missingness > 60% AND it's not the target.

Don't use when: - You'd lose > 10-20% of rows. - Missingness is informative (MNAR).

4. Strategy 2: SimpleImputer¶

The bread-and-butter approach.

Numeric — median (robust to outliers):

from sklearn.impute import SimpleImputer
import numpy as np

num_imputer = SimpleImputer(strategy="median")
# fit on train ONLY
num_imputer.fit(X_train[["age", "income"]])
X_train[["age", "income"]] = num_imputer.transform(X_train[["age", "income"]])
X_test [["age", "income"]] = num_imputer.transform(X_test [["age", "income"]])

# Inspect learned medians
print(num_imputer.statistics_)

Strategies: - "mean" — sensitive to outliers; OK for symmetric distributions. - "median" — default for numeric, robust to outliers. - "most_frequent" — works on any dtype; OK for low-cardinality categorical. - "constant" + fill_value=0 — explicit sentinel.

Categorical — most frequent OR explicit "missing":

cat_imputer = SimpleImputer(strategy="most_frequent")
# Or, more informative — make "missing" a category of its own
cat_imputer = SimpleImputer(strategy="constant", fill_value="missing")

Why a "missing" sentinel can beat "most_frequent": it preserves the signal that the value was missing — sometimes informative (MNAR).

5. Strategy 3: Missing-indicator¶

Add a binary column flagging which values were missing. Pairs well with any imputer.

from sklearn.impute import MissingIndicator

mi = MissingIndicator(features="all")    # one binary col per feature
indicators = mi.fit_transform(X_train)
# Now concatenate to the imputed X

Inside a pipeline:

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.impute import SimpleImputer, MissingIndicator

num_with_indicator = FeatureUnion([
    ("impute",    SimpleImputer(strategy="median")),
    ("indicator", MissingIndicator(missing_values=np.nan)),
])

Or more cleanly, use SimpleImputer(add_indicator=True):

SimpleImputer(strategy="median", add_indicator=True)
# Output has imputed columns + one binary indicator column per imputed feature

6. Strategy 4: KNNImputer¶

Imputes each missing value using the average of its k nearest neighbors (in feature space).

from sklearn.impute import KNNImputer

knn_imp = KNNImputer(
    n_neighbors=5,
    weights="distance",         # "uniform" or "distance"
)
X_train_imp = knn_imp.fit_transform(X_train)
X_test_imp  = knn_imp.transform(X_test)

Pros: captures relationships between features; better than column-wise mean. Cons: slow on big data (O(n²) distances); needs all features scaled first.

Use when: - Mid-size dataset (< 50K rows). - Features have inter-correlations. - You already scaled your data.

7. Strategy 5: IterativeImputer (MICE-style)¶

Iteratively predicts each missing feature using a model trained on the others — like a mini supervised problem per feature, repeated until convergence.

from sklearn.experimental import enable_iterative_imputer    # required import
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor

it_imp = IterativeImputer(
    estimator=RandomForestRegressor(n_estimators=10),
    max_iter=10,
    random_state=42,
)
X_train_imp = it_imp.fit_transform(X_train)

Pros: highest-quality imputation; handles multiple missing patterns. Cons: slow; can overfit; "experimental" status (API may change).

Use when: - Accuracy matters more than speed. - Several columns have missing values that correlate. - You've already exhausted simpler imputers.

8. Putting it together — production-shape¶

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier

num_cols = ["age", "income", "tenure_months"]
cat_cols = ["country", "plan_type"]

num_pipe = Pipeline([
    ("impute",   SimpleImputer(strategy="median", add_indicator=True)),
    ("scale",    StandardScaler()),
])
cat_pipe = Pipeline([
    ("impute",   SimpleImputer(strategy="constant", fill_value="missing")),
    ("ohe",      OneHotEncoder(handle_unknown="ignore")),
])

preprocess = ColumnTransformer([
    ("num", num_pipe, num_cols),
    ("cat", cat_pipe, cat_cols),
])

clf = Pipeline([
    ("prep",  preprocess),
    ("model", RandomForestClassifier(random_state=42)),
])
clf.fit(X_train, y_train)

Imputation now lives inside the pipeline — leak-free, refit per CV fold, saves as one artifact.

9. Common pitfalls¶

❗ Imputing on the FULL dataset before splitting. Train+test stats leak into the imputer. Always inside a pipeline.
❗ Using mean on skewed/outlier-heavy data. median is safer for numeric defaults.
❗ most_frequent for high-cardinality categoricals. Most rows get the modal value; loses signal. Prefer a "missing" sentinel.
❗ KNNImputer without scaling. Distances are dominated by the largest-range feature. Scale first.
❗ Imputing the target. If y has NaN, drop those rows entirely — never impute the label.
❗ Forgetting the missing-indicator column. Sometimes missingness IS the signal (income not reported = MNAR). Use add_indicator=True or MissingIndicator.
❗ Imputing categorical as numeric (-1) then forgetting the encoder treats it as a real category. Either fill before encoding with a sentinel string, or use an encoder that handles unknowns.

10. When to use vs not use¶

Strategy	When
Drop rows	< 5% missing AND data is MCAR.
Drop columns	> 60% missing AND not predictive.
`SimpleImputer(median)`	Default for numeric. Fast, simple, hard to beat.
`SimpleImputer(constant, fill_value="missing")`	Categorical — preserves missingness as a category.
`add_indicator=True`	When missingness might be informative.
`KNNImputer`	Mid-size data with correlated features; already scaled.
`IterativeImputer`	Multiple missing columns with strong inter-relationships, willing to pay compute.
Just leave NaN	Tree-based models like XGBoost / LightGBM accept NaN natively. Test it!

11. Cheatsheet¶

from sklearn.impute import (
    SimpleImputer,      # mean / median / most_frequent / constant
    KNNImputer,         # k-nearest neighbors average
    MissingIndicator,   # binary "was missing" mask
)
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Common patterns
SimpleImputer(strategy="median")                            # numeric default
SimpleImputer(strategy="most_frequent")                     # cat default
SimpleImputer(strategy="constant", fill_value="missing")    # explicit sentinel
SimpleImputer(strategy="median", add_indicator=True)        # + missingness flag

KNNImputer(n_neighbors=5, weights="distance")
IterativeImputer(estimator=BayesianRidge(), max_iter=10)

# Pandas quick fills (NOT for production — leakage risk)
df.fillna(df.median(numeric_only=True))
df["col"].fillna(df["col"].mode()[0])
df.ffill()                                                   # forward-fill (time series)
df.interpolate(method="linear")                              # interpolate

# Check missingness
df.isna().sum()
df.isna().mean()                # proportion per column
df.isna().sum(axis=1).hist()    # rows by # of missing values

# Tree models that handle NaN natively
import xgboost as xgb
xgb.XGBClassifier()             # NaN-aware
# lightgbm.LGBMClassifier() also NaN-aware

12. Q&A — recall test¶

Q: Three types of missingness? A: MCAR (random, unrelated to anything), MAR (depends on other observed features), MNAR (depends on the missing value itself). MNAR is the trickiest — add a missing-indicator.
Q: Default numeric imputer? A: SimpleImputer(strategy="median") — robust to outliers, simple, hard to beat as a baseline.
Q: Why "missing" sentinel over most_frequent for categoricals? A: It preserves the information that the value was missing — sometimes predictive. most_frequent collapses missing rows into the modal class, losing signal.
Q: Why must KNNImputer be paired with scaling? A: Distances are scale-sensitive. Without scaling, the largest-range feature dominates the neighbor computation; the imputation becomes near-meaningless.
Q: What does add_indicator=True do on a SimpleImputer? A: Adds a binary column per imputed feature flagging which rows were originally missing — captures the missingness signal alongside the imputed value.
Q: Should you impute target labels? A: No. Drop those rows entirely. Imputing y invents labels and corrupts training.

Practice¶

What does this print?

Expected: 2.5

import numpy as np
from sklearn.impute import SimpleImputer
X = np.array([[1.0], [2.0], [np.nan], [3.0], [4.0]])
imp = SimpleImputer(strategy="median").fit(X)
print(imp.transform([[np.nan]])[0, 0])    # median of [1, 2, 3, 4] = 2.5

Impute medians on TRAIN only, then transform test (no fit on test)

Expected: True

import numpy as np
from sklearn.impute import SimpleImputer
X_tr = np.array([[1.0], [2.0], [np.nan], [4.0]])
X_te = np.array([[np.nan], [100.0]])
imp = SimpleImputer(strategy="median").fit(np.vstack([X_tr, X_te]))   # bug: leakage
print(imp.statistics_[0] < 5)

Quiz — Quick check¶

What you remember

Q1. Which imputation strategy is most robust to outliers?

mean
median
zero
mode

Why: Outliers heavily skew the mean. Median is unaffected. For categorical features use strategy="most_frequent" (the mode).

Q2. Should you ever IMPUTE the target variable y?

No — drop those rows entirely
Yes, with the mean
Yes, with the median
Sometimes

Why: Imputing the target invents labels. You'd be teaching the model on fabricated truth — a recipe for inflated training metrics and a broken model.

Q3. What does KNNImputer do?

Drops k nearest neighbors
Fills missing values using the average of the k nearest neighbors (in feature space)
Clusters by k
Same as SimpleImputer

Why: KNNImputer is smarter than mean/median — it uses similar rows' values. Slow on big data; use IterativeImputer for an alternative when you need quality.

Common doubts¶

Should I always impute, or sometimes drop?

Drop rows when (a) y is missing, (b) <1% missing in a critical feature, or © random missingness with abundant data. Impute when (a) missingness is meaningful, (b) you'd lose too much data by dropping, or © the feature is critical and missing rarely.

Is 'missingness' itself a useful feature?

Yes — often. Add a boolean indicator: df["x_missing"] = df["x"].isna(). The fact that a value is missing can be predictive (lazy users, broken sensors, etc.). Then impute x separately.

How does XGBoost handle missing values natively?

XGBoost, LightGBM, and CatBoost can natively handle NaN by learning the best direction to send missing values at each split. No imputation needed — just pass NaN through. One advantage of gradient boosting libraries over sklearn's classical models.