Regression Metrics¶
Once you've fit a model, how do you know if it's any good?
For regression, four metrics cover almost every case:
| Metric | What it measures | Range | Lower or higher better? |
|---|---|---|---|
| MAE | Mean Absolute Error | 0 → ∞ (same units as y) |
Lower |
| MSE | Mean Squared Error | 0 → ∞ (squared units) |
Lower |
| RMSE | Root MSE | 0 → ∞ (same units as y) |
Lower |
| R² | Proportion of variance explained | -∞ → 1 |
Higher |
How to pick¶
- MAE — "What's the average error in real units?" Robust to outliers.
- RMSE — Same units as
y, but penalizes large errors more (because of the square). Industry default for benchmarking. - R² — A 0-1 score that's comparable across datasets. R²=1 is perfect, R²=0 means "no better than predicting the mean."
Compute them all¶
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
mean_absolute_error, mean_squared_error, r2_score,
)
import numpy as np
data = fetch_california_housing(as_frame=True)
X, y = data.data, data.target
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)
model = Pipeline([
("scale", StandardScaler()),
("lr", LinearRegression()),
]).fit(X_tr, y_tr)
y_pred = model.predict(X_te)
mae = mean_absolute_error(y_te, y_pred)
mse = mean_squared_error(y_te, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_te, y_pred)
print(f"MAE : {mae:.3f}")
print(f"MSE : {mse:.3f}")
print(f"RMSE: {rmse:.3f} (target y is in $100k, so this is ~$73k average error)")
print(f"R² : {r2:.3f}")
Expected output (approximate):
R² of 0.58 means the model explains about 58% of the variance in house prices. Not bad for a simple linear model on raw features.
Residual plot — always do this¶
A scatter plot of (predicted, predicted − actual) shows you where the model is breaking.
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np
X, y = fetch_california_housing(return_X_y=True, as_frame=False)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)
pipe = Pipeline([("scale", StandardScaler()), ("lr", LinearRegression())]).fit(X_tr, y_tr)
y_pred = pipe.predict(X_te)
residuals = y_te - y_pred
# Quick summary instead of a chart (browser plotting is heavier than we need here)
print(f"Mean residual : {residuals.mean():.3f} (should be ~0)")
print(f"Std of residuals : {residuals.std():.3f}")
print(f"Max overshoot : +{residuals.max():.3f}")
print(f"Max undershoot : {residuals.min():.3f}")
print(f"Residuals histogram (rough):")
for bucket_start in [-2, -1, 0, 1, 2]:
count = ((residuals >= bucket_start) & (residuals < bucket_start + 1)).sum()
bar = "█" * (count // 60)
print(f" [{bucket_start:+.0f}, {bucket_start+1:+.0f}) {bar} {count}")
A roughly bell-shaped histogram centered at 0 means the model is well-behaved. A skewed shape or a long tail signals something the model isn't capturing.
What you learned¶
- 4 core regression metrics: MAE, MSE, RMSE, R².
- RMSE is the industry default for benchmarking.
- R² is a relative score — "vs predicting the mean."
- Always inspect residuals — a metric number alone hides a lot.
Practice¶
What does this print?
Expected: 1.0
Use RMSE (square root of MSE) for interpretable units
Expected: True
Quiz — Quick check¶
What you remember
Q1. Which metric is most affected by outliers?
- MAE (Mean Absolute Error)
- MSE / RMSE (squared errors amplify large mistakes)
- R²
- All are equally affected
Why: MSE squares errors, so a single big mistake (e.g., 100 off) contributes 10000 to the loss while a small one (1 off) contributes 1. MAE treats all errors linearly — more robust to outliers.
Q2. When R² is negative, what does it mean?
- The model is worse than just predicting the mean of
y - A bug in sklearn
- Perfect predictions (inverted)
- Impossible
Why: R² compares your model to the "predict-the-mean" baseline. R² = 0 means equal to baseline. R² < 0 means worse — your model is actively harmful.
Q3. What does RMSE tell you that MSE doesn't?
- RMSE is in the same units as y — directly interpretable (e.g., "off by $5,000")
- RMSE is more accurate
- MSE handles negatives
- No difference
Why: Same ranking of models — both will pick the same best model. But "RMSE is $5,000" is easier to communicate than "MSE is 25,000,000".
Common doubts¶
Which metric should I optimize for my problem?
Depends on the cost of errors. If big errors are catastrophic (e.g., predicting insurance claims): MSE/RMSE (penalizes them more). If outliers are real but should be tolerated: MAE. If the relative error matters (e.g., predicting prices across $10 to $1M): MAPE.
What's adjusted R² and when does it matter?
Adjusted R² penalizes adding features that don't actually help. With many features, regular R² can keep increasing while adjusted R² drops — signaling overfitting. For comparing models with different feature counts, use adjusted R² (or just hold out a test set).
Should I report R² on training or test data?
Test data — that's what counts. Training R² always looks great because the model is fit to that data. Test R² tells you how the model generalizes. Always report test scores; training scores are mostly for diagnosing overfitting (if train >> test, you're overfitting).