Regression Metrics¶

Once you've fit a model, how do you know if it's any good?

For regression, four metrics cover almost every case:

Metric	What it measures	Range	Lower or higher better?
MAE	Mean Absolute Error	`0 → ∞` (same units as y)	Lower
MSE	Mean Squared Error	`0 → ∞` (squared units)	Lower
RMSE	Root MSE	`0 → ∞` (same units as y)	Lower
R²	Proportion of variance explained	`-∞ → 1`	Higher

How to pick¶

MAE — "What's the average error in real units?" Robust to outliers.
RMSE — Same units as y, but penalizes large errors more (because of the square). Industry default for benchmarking.
R² — A 0-1 score that's comparable across datasets. R²=1 is perfect, R²=0 means "no better than predicting the mean."

Compute them all¶

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    mean_absolute_error, mean_squared_error, r2_score,
)
import numpy as np

data = fetch_california_housing(as_frame=True)
X, y = data.data, data.target

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)

model = Pipeline([
    ("scale", StandardScaler()),
    ("lr",    LinearRegression()),
]).fit(X_tr, y_tr)

y_pred = model.predict(X_te)

mae  = mean_absolute_error(y_te, y_pred)
mse  = mean_squared_error(y_te, y_pred)
rmse = np.sqrt(mse)
r2   = r2_score(y_te, y_pred)

print(f"MAE : {mae:.3f}")
print(f"MSE : {mse:.3f}")
print(f"RMSE: {rmse:.3f}  (target y is in $100k, so this is ~$73k average error)")
print(f"R²  : {r2:.3f}")

Expected output (approximate):

MAE : 0.533
MSE : 0.556
RMSE: 0.746
R²  : 0.576

R² of 0.58 means the model explains about 58% of the variance in house prices. Not bad for a simple linear model on raw features.

Residual plot — always do this¶

A scatter plot of (predicted, predicted − actual) shows you where the model is breaking.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np

X, y = fetch_california_housing(return_X_y=True, as_frame=False)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)

pipe = Pipeline([("scale", StandardScaler()), ("lr", LinearRegression())]).fit(X_tr, y_tr)
y_pred = pipe.predict(X_te)
residuals = y_te - y_pred

# Quick summary instead of a chart (browser plotting is heavier than we need here)
print(f"Mean residual    : {residuals.mean():.3f}  (should be ~0)")
print(f"Std of residuals : {residuals.std():.3f}")
print(f"Max overshoot    : +{residuals.max():.3f}")
print(f"Max undershoot   : {residuals.min():.3f}")
print(f"Residuals histogram (rough):")
for bucket_start in [-2, -1, 0, 1, 2]:
    count = ((residuals >= bucket_start) & (residuals < bucket_start + 1)).sum()
    bar = "█" * (count // 60)
    print(f"  [{bucket_start:+.0f}, {bucket_start+1:+.0f}) {bar} {count}")

A roughly bell-shaped histogram centered at 0 means the model is well-behaved. A skewed shape or a long tail signals something the model isn't capturing.

What you learned¶

4 core regression metrics: MAE, MSE, RMSE, R².
RMSE is the industry default for benchmarking.
R² is a relative score — "vs predicting the mean."
Always inspect residuals — a metric number alone hides a lot.

Practice¶

What does this print?

Expected: 1.0

from sklearn.metrics import r2_score
y_true = [1, 2, 3, 4, 5]
y_pred = [1, 2, 3, 4, 5]      # perfect predictions
print(r2_score(y_true, y_pred))

Use RMSE (square root of MSE) for interpretable units

Expected: True

import numpy as np
from sklearn.metrics import mean_squared_error
y_true = [3, 5, 7, 9]
y_pred = [2.5, 5.5, 7.0, 8.5]
rmse = mean_squared_error(y_true, y_pred)      # bug: returns MSE, not RMSE
print(abs(rmse - 0.18) < 0.01)

Quiz — Quick check¶

What you remember

Q1. Which metric is most affected by outliers?

MAE (Mean Absolute Error)
MSE / RMSE (squared errors amplify large mistakes)
R²
All are equally affected

Why: MSE squares errors, so a single big mistake (e.g., 100 off) contributes 10000 to the loss while a small one (1 off) contributes 1. MAE treats all errors linearly — more robust to outliers.

Q2. When R² is negative, what does it mean?

The model is worse than just predicting the mean of y
A bug in sklearn
Perfect predictions (inverted)
Impossible

Why: R² compares your model to the "predict-the-mean" baseline. R² = 0 means equal to baseline. R² < 0 means worse — your model is actively harmful.

Q3. What does RMSE tell you that MSE doesn't?

RMSE is in the same units as y — directly interpretable (e.g., "off by $5,000")
RMSE is more accurate
MSE handles negatives
No difference

Why: Same ranking of models — both will pick the same best model. But "RMSE is $5,000" is easier to communicate than "MSE is 25,000,000".

Common doubts¶

Which metric should I optimize for my problem?

Depends on the cost of errors. If big errors are catastrophic (e.g., predicting insurance claims): MSE/RMSE (penalizes them more). If outliers are real but should be tolerated: MAE. If the relative error matters (e.g., predicting prices across $10 to $1M): MAPE.

What's adjusted R² and when does it matter?

Adjusted R² penalizes adding features that don't actually help. With many features, regular R² can keep increasing while adjusted R² drops — signaling overfitting. For comparing models with different feature counts, use adjusted R² (or just hold out a test set).

Should I report R² on training or test data?

Test data — that's what counts. Training R² always looks great because the model is fit to that data. Test R² tells you how the model generalizes. Always report test scores; training scores are mostly for diagnosing overfitting (if train >> test, you're overfitting).

→ Next: Worked Example — California Housing