Handling Missing Data¶

Real data has gaps. Customers leave fields blank, sensors fail, joins don't match. Pandas represents missing values as NaN (Not a Number) — a special float value.

How NaN shows up¶

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "name":   ["Alice","Bob","Carol","Dave"],
    "age":    [25, np.nan, 35, None],
    "city":   ["Mumbai", None, "Pune", "Delhi"],
    "salary": [50000, 60000, np.nan, 90000],
})

print(df)
print()
print(df.dtypes)

None and np.nan both become NaN. For numeric columns, NaN is a float so the column type bumps from int to float.

Finding missing — `.isna()` / `.notna()`¶

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "name":   ["Alice","Bob","Carol","Dave"],
    "age":    [25, np.nan, 35, None],
    "salary": [50000, 60000, np.nan, 90000],
})

print("Element-wise:")
print(df.isna())
print()

print("Count per column:")
print(df.isna().sum())
print()

print("Percent missing per column:")
print((df.isna().mean() * 100).round(1))
print()

print("Any missing in each row:")
print(df.isna().any(axis=1))
print()

print("Total missing cells:", df.isna().sum().sum())

isnull() is an alias of isna(). Use whichever feels natural.

Filtering by missingness¶

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "name":   ["Alice","Bob","Carol","Dave"],
    "age":    [25, np.nan, 35, np.nan],
    "salary": [50000, 60000, np.nan, 90000],
})

# Rows where age is missing
print(df[df["age"].isna()])
print()

# Rows where age is NOT missing
print(df[df["age"].notna()])
print()

# Rows missing in ANY column
print(df[df.isna().any(axis=1)])
print()

# Rows missing in ALL columns
print(df[df.isna().all(axis=1)])

Strategy 1: Drop missing — `.dropna()`¶

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "name":   ["Alice","Bob","Carol","Dave"],
    "age":    [25, np.nan, 35, np.nan],
    "salary": [50000, np.nan, np.nan, 90000],
})

# Drop rows with ANY missing
print(df.dropna())
print()

# Drop rows missing in SPECIFIC columns
print(df.dropna(subset=["age"]))
print()

# Drop rows with ALL columns missing
print(df.dropna(how="all"))
print()

# Drop COLUMNS with any missing
print(df.dropna(axis=1))
print()

# Keep rows with at least N non-null values
print(df.dropna(thresh=3))

Use sparingly — dropping rows discards data. Often a smarter fill is better.

Strategy 2: Fill with a value — `.fillna()`¶

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "name":   ["Alice","Bob","Carol","Dave"],
    "age":    [25, np.nan, 35, np.nan],
    "salary": [50000, np.nan, 75000, 90000],
})

# Fill ALL NaN with 0
print(df.fillna(0))
print()

# Fill different cols with different values
print(df.fillna({"age": -1, "salary": 0}))
print()

# Fill with column mean (numeric)
df["age"]    = df["age"].fillna(df["age"].mean())
df["salary"] = df["salary"].fillna(df["salary"].median())
print(df)

Strategy 3: Forward / backward fill — `ffill` / `bfill`¶

For time series, "carry the last known value forward":

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "date":  pd.date_range("2025-01-01", periods=6),
    "price": [100, np.nan, np.nan, 105, np.nan, 110],
})

print("Original:")
print(df)
print()

df["price_ffill"] = df["price"].ffill()      # carry forward
df["price_bfill"] = df["price"].bfill()      # carry backward
print(df)

Strategy 4: Interpolate — fill the gap smoothly¶

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "temp": [20, np.nan, np.nan, 23, np.nan, 27],
})

df["linear"]    = df["temp"].interpolate(method="linear")
df["quadratic"] = df["temp"].interpolate(method="quadratic")
print(df)

interpolate is great for time series and any monotonic-ish numeric data.

Strategy 5: Replace specific values that mean "missing"¶

Some datasets use -1, 999, "?", "N/A" as missing-value markers. Convert them:

import pandas as pd

df = pd.DataFrame({
    "age":  [25, -1, 35, -1, 45],
    "rating": ["good", "?", "ok", "bad", "?"],
})

# Replace specific markers with NaN
df["age"]    = df["age"].replace(-1, pd.NA)
df["rating"] = df["rating"].replace("?", pd.NA)
print(df)
print()
print(df.isna().sum())

You can also pass them to read_csv directly:

df = pd.read_csv("data.csv", na_values=["?", "N/A", "-1", "9999"])

Multi-column smart fill — `groupby` + mean¶

Fill missing salary with the average salary of the same city:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "city":   ["Mumbai","Delhi","Mumbai","Delhi","Mumbai","Pune"],
    "salary": [60000, 70000, np.nan, 75000, 65000, np.nan],
})

# Fill NaN with the city's mean
df["salary"] = df.groupby("city")["salary"].transform(
    lambda s: s.fillna(s.mean())
)
print(df)

This is a much smarter fill than a single global mean.

Boolean fills¶

For columns with True/False/NaN:

import pandas as pd
import numpy as np

df = pd.DataFrame({"agreed": [True, False, np.nan, True, np.nan]})

# Treat NaN as False
df["agreed"] = df["agreed"].fillna(False).astype(bool)
print(df)

Counting non-missing values¶

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "a": [1, 2, np.nan, 4],
    "b": [np.nan, np.nan, 3, 4],
})

print("Non-null per col :")
print(df.count())
print()

print("Total non-null cells:", df.count().sum())

Aggregations skip NaN by default¶

import pandas as pd
import numpy as np

s = pd.Series([1, 2, np.nan, 4, 5])

print("sum  :", s.sum())     # 12 — NaN skipped
print("mean :", s.mean())     # 3.0 — divides by 4 non-NaN values
print("count:", s.count())    # 4 (excludes NaN)
print("size :", s.size)       # 5 (everything)

If you want NaN to "infect" the result, pass skipna=False:

import pandas as pd
import numpy as np

s = pd.Series([1, 2, np.nan, 4])
print(s.sum(skipna=False))    # NaN

A complete cleaning example¶

import pandas as pd
import numpy as np

# Messy dataset
df = pd.DataFrame({
    "name":   ["Alice","Bob","Carol","Dave","Eve","Frank","Grace","Henry"],
    "age":    [25, np.nan, 35, None, 45, -1, 28, 40],
    "city":   ["Mumbai","Delhi","Mumbai","?","Delhi","Mumbai","Pune","Pune"],
    "salary": [50000, 60000, np.nan, 90000, 100000, 110000, np.nan, 65000],
})

print("Before:")
print(df)
print()
print("Missing:", df.isna().sum().sum())

# 1. Normalize markers
df["city"] = df["city"].replace("?", np.nan)
df.loc[df["age"] < 0, "age"] = np.nan

# 2. Drop rows missing the name
df = df.dropna(subset=["name"])

# 3. Fill age with median
df["age"] = df["age"].fillna(df["age"].median())

# 4. Fill salary with city-wise median
df["salary"] = df.groupby("city")["salary"].transform(
    lambda s: s.fillna(s.median())
)

# 5. Drop any remaining missing
df = df.dropna()

print("\nAfter:")
print(df)
print("Missing:", df.isna().sum().sum())

This is the basic pipeline that runs at the start of nearly every ML project.

Cheatsheet¶

Task	Code
Find missing cells	`df.isna()`
Count missing per col	`df.isna().sum()`
Drop missing rows	`df.dropna()`
Drop missing cols	`df.dropna(axis=1)`
Drop only if specific cols missing	`df.dropna(subset=["a"])`
Fill with a value	`df.fillna(0)`
Per-column fill	`df.fillna({"a": 0, "b": "?"})`
Forward fill	`df.ffill()`
Backward fill	`df.bfill()`
Interpolate	`df.interpolate()`
Replace markers with NaN	`df.replace("?", np.nan)`
Group-wise fill	`df.groupby("g")["c"].transform(lambda s: s.fillna(s.mean()))`

Common pitfalls¶

❗ == doesn't work with NaN — s == np.nan is always False. Use s.isna().
❗ fillna(0) for categorical columns — 0 may be a real category. Use a sentinel string like "missing".
❗ Mean of mostly-missing column — meaningless. If > 60% missing, consider dropping the column.
❗ NaN in int columns — Pandas auto-converts to float. To stay integer, use the nullable Int64 dtype: df["c"].astype("Int64").
❗ dropna() mutates only if inplace=True — by default it returns a new DataFrame.
❗ Forgetting to handle NaN before ML — most sklearn estimators raise. Either fill or drop before passing to a model.

Practice¶

What does this print?

Expected: 30.0

import pandas as pd
import numpy as np
s = pd.Series([20, np.nan, 40])
print(s.mean())          # mean skips NaN by default

Replace -1 (used as a missing marker) with NaN before computing the mean

Expected: 30.0

import pandas as pd
import numpy as np
s = pd.Series([20, -1, 40])
print(s.mean())          # bug: -1 is mixed into the mean — convert to NaN first

Quiz — Quick check¶

What you remember

Q1. Why does s == np.nan return False for every element?

NumPy bug
By IEEE float standard, NaN compares unequal to everything — even itself
Only for floats
You need np.isnan

Why: Use s.isna() (or np.isnan(s)) to find missing values, never s == np.nan.

Q2. What's the difference between s.count() and s.size?

No difference
.count() is non-null count; .size is total element count (including NaN)
.size is faster
.count() works on Series only

Why: count is data-aware; size is just the length. For [1, nan, 3]: count is 2, size is 3.

Q3. When should you use df.fillna(df.mean()) vs df.groupby("g")["v"].transform(lambda s: s.fillna(s.mean()))?

Always use global mean — simpler
Group-wise fill is smarter when the groups have different distributions
They're equivalent
Group fill doesn't work

Why: Filling salary with the global mean ignores city differences. Filling with the per-city mean preserves group structure — usually better for ML.

Common doubts¶

Why does my int column become float after I drop missing values?

Because NumPy's int dtype can't store NaN. The moment NaN appears, the column promotes to float. Pandas 1.0+ has nullable integer types — use df["c"].astype("Int64") (capital I) to keep ints with NaN-like missing values.

Should I always fill missing data?

No. Sometimes dropping is correct (genuinely bad records). Sometimes the model handles NaN itself (XGBoost, LightGBM). Sometimes "missing" is a feature ("this user didn't fill the field" can be predictive). Decide based on the data, not on reflex.

What's the difference between fillna(0) and replace(np.nan, 0)?

They're equivalent for filling NaN. fillna has more options (method="ffill", per-column dicts). replace is more general — can replace any value with any other value.

What's next¶

→ Sorting & Ranking