Worked Example — California Housing¶
Time to put it all together. We'll predict California house prices using all the features in one go.
The dataset has ~20,000 districts, each with 8 numeric features (median income, house age, rooms per household, latitude, longitude, etc.) and a target (median house value in $100k).
End-to-end pipeline¶
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error, r2_score
import numpy as np
# 1. Load
data = fetch_california_housing(as_frame=True)
X, y = data.data, data.target
print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print("Features:", list(X.columns))
# 2. Split
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Train: {X_tr.shape[0]} | Test: {X_te.shape[0]}")
# 3. Pipeline: scale + linear regression
pipe = Pipeline([
("scale", StandardScaler()),
("lr", LinearRegression()),
])
# 4. Cross-validated R² on the training set (5 folds)
cv_scores = cross_val_score(pipe, X_tr, y_tr, cv=5, scoring="r2")
print(f"\nCV R² mean: {cv_scores.mean():.3f} (±{cv_scores.std():.3f})")
# 5. Final fit on full training set, evaluate on held-out test
pipe.fit(X_tr, y_tr)
y_pred = pipe.predict(X_te)
print(f"Test MAE: {mean_absolute_error(y_te, y_pred):.3f} (in $100k)")
print(f"Test R² : {r2_score(y_te, y_pred):.3f}")
Expected output (approximate):
Dataset: 20640 samples, 8 features
Features: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population',
'AveOccup', 'Latitude', 'Longitude']
Train: 16512 | Test: 4128
CV R² mean: 0.598 (±0.013)
Test MAE: 0.533 (in $100k)
Test R² : 0.576
Inspect the coefficients¶
Which features matter most?
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import pandas as pd
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)
pipe = Pipeline([("scale", StandardScaler()), ("lr", LinearRegression())]).fit(X_tr, y_tr)
coefs = pd.Series(pipe.named_steps["lr"].coef_, index=X.columns)
print(coefs.abs().sort_values(ascending=False).round(3))
You'll see MedInc (median income) and Latitude/Longitude (location) dominate. That matches intuition: house prices are mostly driven by income and location.
What did the model learn?¶
Some quick takeaways from the coefficients:
| Feature | Coef sign | Interpretation |
|---|---|---|
MedInc |
➕ | Wealthier areas → higher prices. |
Latitude |
➖ | Higher latitude (further north in CA) → lower prices. |
AveRooms |
➖ | More rooms → lower prices? Counter-intuitive — confounded with rural areas. |
When coefficients are surprising, that's where feature engineering (and sometimes non-linear models) help. A simple linear model can't fully untangle confounders.
What you learned¶
- Full pipeline: split → scale + model → cross-validate → test.
cross_val_score(model, X, y, cv=5, scoring=...)gives an honest estimate without touching test.- Coefficients tell you which features the model relies on.
- 0.58 R² on raw features is a reasonable baseline. Tree-based models would do better (see Ensembles).
Practice¶
What does this print?
Expected: True
Use train/test split, not the whole dataset for evaluation
Expected: True
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
X, y = fetch_california_housing(return_X_y=True)
model = LinearRegression().fit(X, y)
train_score = model.score(X, y) # bug: testing on the same data we trained on
print(train_score > 0.5)
Quiz — Quick check¶
What you remember
Q1. What's the right order of steps in an ML pipeline?
- Train → split → evaluate
- Split → fit preprocessing on train → train model → evaluate on test
- Evaluate first
- Order doesn't matter
Why: Splitting first ensures the test set is truly unseen. Fitting preprocessing on train only prevents data leakage. Evaluate on the held-out test ONCE.
Q2. Why is R² ≈ 0.6 on California housing not necessarily bad?
- Real-world tabular data often has irreducible noise — perfect prediction (R² = 1) is impossible
- R² is meaningless for housing
- sklearn computes R² wrong
- 0.6 is great accuracy
Why: House prices depend on many unmeasured factors (curb appeal, school district sentiment, recent news). No model with just 8 numerical features can capture all of it. 0.6 R² means we explain 60% of the variance — a reasonable baseline.
Q3. What would likely give better R² than linear regression on this dataset?
- Adding more features blindly
- Tree-based models (Random Forest, XGBoost) that capture non-linearities and feature interactions
- Removing features
- Using accuracy instead of R²
Why: Linear regression assumes a flat relationship; real housing prices have non-linear patterns and interactions. Trees handle these natively without feature engineering. Typically Random Forest gets ~0.8 R² on this dataset.
Common doubts¶
Should I report MSE or RMSE for housing?
RMSE — it's in dollars, directly interpretable. "Predictions are off by $40,000 RMSE" is clearer than "MSE is 1,600,000,000". Both rank models identically.
Why does scaling matter for linear regression here?
Doesn't affect predictions (linear regression is invariant to scaling), but makes coefficients comparable. Without scaling, "median income" might get a huge coefficient just because its scale is small — misleading for interpretation.
Is the California housing dataset still relevant?
For learning, yes — it's a classic benchmark, well-understood, fast to load. For production, you'd use real-time MLS data and many more features. But the pipeline shape (load → split → preprocess → train → evaluate) is identical.