Handling Missing Data¶
Real data has gaps. Customers leave fields blank, sensors fail, joins don't match. Pandas represents missing values as NaN (Not a Number) — a special float value.
How NaN shows up¶
import pandas as pd
import numpy as np
df = pd.DataFrame({
"name": ["Alice","Bob","Carol","Dave"],
"age": [25, np.nan, 35, None],
"city": ["Mumbai", None, "Pune", "Delhi"],
"salary": [50000, 60000, np.nan, 90000],
})
print(df)
print()
print(df.dtypes)
None and np.nan both become NaN. For numeric columns, NaN is a float so the column type bumps from int to float.
Finding missing — .isna() / .notna()¶
import pandas as pd
import numpy as np
df = pd.DataFrame({
"name": ["Alice","Bob","Carol","Dave"],
"age": [25, np.nan, 35, None],
"salary": [50000, 60000, np.nan, 90000],
})
print("Element-wise:")
print(df.isna())
print()
print("Count per column:")
print(df.isna().sum())
print()
print("Percent missing per column:")
print((df.isna().mean() * 100).round(1))
print()
print("Any missing in each row:")
print(df.isna().any(axis=1))
print()
print("Total missing cells:", df.isna().sum().sum())
isnull() is an alias of isna(). Use whichever feels natural.
Filtering by missingness¶
import pandas as pd
import numpy as np
df = pd.DataFrame({
"name": ["Alice","Bob","Carol","Dave"],
"age": [25, np.nan, 35, np.nan],
"salary": [50000, 60000, np.nan, 90000],
})
# Rows where age is missing
print(df[df["age"].isna()])
print()
# Rows where age is NOT missing
print(df[df["age"].notna()])
print()
# Rows missing in ANY column
print(df[df.isna().any(axis=1)])
print()
# Rows missing in ALL columns
print(df[df.isna().all(axis=1)])
Strategy 1: Drop missing — .dropna()¶
import pandas as pd
import numpy as np
df = pd.DataFrame({
"name": ["Alice","Bob","Carol","Dave"],
"age": [25, np.nan, 35, np.nan],
"salary": [50000, np.nan, np.nan, 90000],
})
# Drop rows with ANY missing
print(df.dropna())
print()
# Drop rows missing in SPECIFIC columns
print(df.dropna(subset=["age"]))
print()
# Drop rows with ALL columns missing
print(df.dropna(how="all"))
print()
# Drop COLUMNS with any missing
print(df.dropna(axis=1))
print()
# Keep rows with at least N non-null values
print(df.dropna(thresh=3))
Use sparingly — dropping rows discards data. Often a smarter fill is better.
Strategy 2: Fill with a value — .fillna()¶
import pandas as pd
import numpy as np
df = pd.DataFrame({
"name": ["Alice","Bob","Carol","Dave"],
"age": [25, np.nan, 35, np.nan],
"salary": [50000, np.nan, 75000, 90000],
})
# Fill ALL NaN with 0
print(df.fillna(0))
print()
# Fill different cols with different values
print(df.fillna({"age": -1, "salary": 0}))
print()
# Fill with column mean (numeric)
df["age"] = df["age"].fillna(df["age"].mean())
df["salary"] = df["salary"].fillna(df["salary"].median())
print(df)
Strategy 3: Forward / backward fill — ffill / bfill¶
For time series, "carry the last known value forward":
import pandas as pd
import numpy as np
df = pd.DataFrame({
"date": pd.date_range("2025-01-01", periods=6),
"price": [100, np.nan, np.nan, 105, np.nan, 110],
})
print("Original:")
print(df)
print()
df["price_ffill"] = df["price"].ffill() # carry forward
df["price_bfill"] = df["price"].bfill() # carry backward
print(df)
Strategy 4: Interpolate — fill the gap smoothly¶
import pandas as pd
import numpy as np
df = pd.DataFrame({
"temp": [20, np.nan, np.nan, 23, np.nan, 27],
})
df["linear"] = df["temp"].interpolate(method="linear")
df["quadratic"] = df["temp"].interpolate(method="quadratic")
print(df)
interpolate is great for time series and any monotonic-ish numeric data.
Strategy 5: Replace specific values that mean "missing"¶
Some datasets use -1, 999, "?", "N/A" as missing-value markers. Convert them:
import pandas as pd
df = pd.DataFrame({
"age": [25, -1, 35, -1, 45],
"rating": ["good", "?", "ok", "bad", "?"],
})
# Replace specific markers with NaN
df["age"] = df["age"].replace(-1, pd.NA)
df["rating"] = df["rating"].replace("?", pd.NA)
print(df)
print()
print(df.isna().sum())
You can also pass them to read_csv directly:
Multi-column smart fill — groupby + mean¶
Fill missing salary with the average salary of the same city:
import pandas as pd
import numpy as np
df = pd.DataFrame({
"city": ["Mumbai","Delhi","Mumbai","Delhi","Mumbai","Pune"],
"salary": [60000, 70000, np.nan, 75000, 65000, np.nan],
})
# Fill NaN with the city's mean
df["salary"] = df.groupby("city")["salary"].transform(
lambda s: s.fillna(s.mean())
)
print(df)
This is a much smarter fill than a single global mean.
Boolean fills¶
For columns with True/False/NaN:
import pandas as pd
import numpy as np
df = pd.DataFrame({"agreed": [True, False, np.nan, True, np.nan]})
# Treat NaN as False
df["agreed"] = df["agreed"].fillna(False).astype(bool)
print(df)
Counting non-missing values¶
import pandas as pd
import numpy as np
df = pd.DataFrame({
"a": [1, 2, np.nan, 4],
"b": [np.nan, np.nan, 3, 4],
})
print("Non-null per col :")
print(df.count())
print()
print("Total non-null cells:", df.count().sum())
Aggregations skip NaN by default¶
import pandas as pd
import numpy as np
s = pd.Series([1, 2, np.nan, 4, 5])
print("sum :", s.sum()) # 12 — NaN skipped
print("mean :", s.mean()) # 3.0 — divides by 4 non-NaN values
print("count:", s.count()) # 4 (excludes NaN)
print("size :", s.size) # 5 (everything)
If you want NaN to "infect" the result, pass skipna=False:
import pandas as pd
import numpy as np
s = pd.Series([1, 2, np.nan, 4])
print(s.sum(skipna=False)) # NaN
A complete cleaning example¶
import pandas as pd
import numpy as np
# Messy dataset
df = pd.DataFrame({
"name": ["Alice","Bob","Carol","Dave","Eve","Frank","Grace","Henry"],
"age": [25, np.nan, 35, None, 45, -1, 28, 40],
"city": ["Mumbai","Delhi","Mumbai","?","Delhi","Mumbai","Pune","Pune"],
"salary": [50000, 60000, np.nan, 90000, 100000, 110000, np.nan, 65000],
})
print("Before:")
print(df)
print()
print("Missing:", df.isna().sum().sum())
# 1. Normalize markers
df["city"] = df["city"].replace("?", np.nan)
df.loc[df["age"] < 0, "age"] = np.nan
# 2. Drop rows missing the name
df = df.dropna(subset=["name"])
# 3. Fill age with median
df["age"] = df["age"].fillna(df["age"].median())
# 4. Fill salary with city-wise median
df["salary"] = df.groupby("city")["salary"].transform(
lambda s: s.fillna(s.median())
)
# 5. Drop any remaining missing
df = df.dropna()
print("\nAfter:")
print(df)
print("Missing:", df.isna().sum().sum())
This is the basic pipeline that runs at the start of nearly every ML project.
Cheatsheet¶
| Task | Code |
|---|---|
| Find missing cells | df.isna() |
| Count missing per col | df.isna().sum() |
| Drop missing rows | df.dropna() |
| Drop missing cols | df.dropna(axis=1) |
| Drop only if specific cols missing | df.dropna(subset=["a"]) |
| Fill with a value | df.fillna(0) |
| Per-column fill | df.fillna({"a": 0, "b": "?"}) |
| Forward fill | df.ffill() |
| Backward fill | df.bfill() |
| Interpolate | df.interpolate() |
| Replace markers with NaN | df.replace("?", np.nan) |
| Group-wise fill | df.groupby("g")["c"].transform(lambda s: s.fillna(s.mean())) |
Common pitfalls¶
- ❗
==doesn't work with NaN —s == np.nanis always False. Uses.isna(). - ❗
fillna(0)for categorical columns —0may be a real category. Use a sentinel string like"missing". - ❗ Mean of mostly-missing column — meaningless. If > 60% missing, consider dropping the column.
- ❗ NaN in
intcolumns — Pandas auto-converts tofloat. To stay integer, use the nullableInt64dtype:df["c"].astype("Int64"). - ❗
dropna()mutates only ifinplace=True— by default it returns a new DataFrame. - ❗ Forgetting to handle NaN before ML — most sklearn estimators raise. Either fill or drop before passing to a model.
Practice¶
What does this print?
Expected: 30.0
Replace -1 (used as a missing marker) with NaN before computing the mean
Expected: 30.0
Quiz — Quick check¶
What you remember
Q1. Why does s == np.nan return False for every element?
- NumPy bug
- By IEEE float standard,
NaNcompares unequal to everything — even itself - Only for floats
- You need
np.isnan
Why: Use
s.isna()(ornp.isnan(s)) to find missing values, nevers == np.nan.
Q2. What's the difference between s.count() and s.size?
- No difference
-
.count()is non-null count;.sizeis total element count (including NaN) -
.sizeis faster -
.count()works on Series only
Why:
countis data-aware;sizeis just the length. For[1, nan, 3]:countis 2,sizeis 3.
Q3. When should you use df.fillna(df.mean()) vs df.groupby("g")["v"].transform(lambda s: s.fillna(s.mean()))?
- Always use global mean — simpler
- Group-wise fill is smarter when the groups have different distributions
- They're equivalent
- Group fill doesn't work
Why: Filling salary with the global mean ignores city differences. Filling with the per-city mean preserves group structure — usually better for ML.
Common doubts¶
Why does my int column become float after I drop missing values?
Because NumPy's int dtype can't store NaN. The moment NaN appears, the column promotes to float. Pandas 1.0+ has nullable integer types — use df["c"].astype("Int64") (capital I) to keep ints with NaN-like missing values.
Should I always fill missing data?
No. Sometimes dropping is correct (genuinely bad records). Sometimes the model handles NaN itself (XGBoost, LightGBM). Sometimes "missing" is a feature ("this user didn't fill the field" can be predictive). Decide based on the data, not on reflex.
What's the difference between fillna(0) and replace(np.nan, 0)?
They're equivalent for filling NaN. fillna has more options (method="ffill", per-column dicts). replace is more general — can replace any value with any other value.