Worked Example — California Housing¶

Time to put it all together. We'll predict California house prices using all the features in one go.

The dataset has ~20,000 districts, each with 8 numeric features (median income, house age, rooms per household, latitude, longitude, etc.) and a target (median house value in $100k).

End-to-end pipeline¶

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error, r2_score
import numpy as np

# 1. Load
data = fetch_california_housing(as_frame=True)
X, y = data.data, data.target
print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print("Features:", list(X.columns))

# 2. Split
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Train: {X_tr.shape[0]}  |  Test: {X_te.shape[0]}")

# 3. Pipeline: scale + linear regression
pipe = Pipeline([
    ("scale", StandardScaler()),
    ("lr",    LinearRegression()),
])

# 4. Cross-validated R² on the training set (5 folds)
cv_scores = cross_val_score(pipe, X_tr, y_tr, cv=5, scoring="r2")
print(f"\nCV R² mean: {cv_scores.mean():.3f}  (±{cv_scores.std():.3f})")

# 5. Final fit on full training set, evaluate on held-out test
pipe.fit(X_tr, y_tr)
y_pred = pipe.predict(X_te)
print(f"Test MAE: {mean_absolute_error(y_te, y_pred):.3f}  (in $100k)")
print(f"Test R² : {r2_score(y_te, y_pred):.3f}")

Expected output (approximate):

Dataset: 20640 samples, 8 features
Features: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population',
           'AveOccup', 'Latitude', 'Longitude']
Train: 16512  |  Test: 4128

CV R² mean: 0.598  (±0.013)
Test MAE: 0.533  (in $100k)
Test R² : 0.576

Inspect the coefficients¶

Which features matter most?

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import pandas as pd

X, y = fetch_california_housing(return_X_y=True, as_frame=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)

pipe = Pipeline([("scale", StandardScaler()), ("lr", LinearRegression())]).fit(X_tr, y_tr)
coefs = pd.Series(pipe.named_steps["lr"].coef_, index=X.columns)
print(coefs.abs().sort_values(ascending=False).round(3))

You'll see MedInc (median income) and Latitude/Longitude (location) dominate. That matches intuition: house prices are mostly driven by income and location.

What did the model learn?¶

Some quick takeaways from the coefficients:

Feature	Coef sign	Interpretation
`MedInc`	➕	Wealthier areas → higher prices.
`Latitude`	➖	Higher latitude (further north in CA) → lower prices.
`AveRooms`	➖	More rooms → lower prices? Counter-intuitive — confounded with rural areas.

When coefficients are surprising, that's where feature engineering (and sometimes non-linear models) help. A simple linear model can't fully untangle confounders.

What you learned¶

Full pipeline: split → scale + model → cross-validate → test.
cross_val_score(model, X, y, cv=5, scoring=...) gives an honest estimate without touching test.
Coefficients tell you which features the model relies on.
0.58 R² on raw features is a reasonable baseline. Tree-based models would do better (see Ensembles).

Practice¶

What does this print?

Expected: True

from sklearn.datasets import fetch_california_housing
X, y = fetch_california_housing(return_X_y=True)
print(X.shape[1] == 8)      # California Housing has 8 features

Use train/test split, not the whole dataset for evaluation

Expected: True

from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
X, y = fetch_california_housing(return_X_y=True)
model = LinearRegression().fit(X, y)
train_score = model.score(X, y)        # bug: testing on the same data we trained on
print(train_score > 0.5)

Quiz — Quick check¶

What you remember

Q1. What's the right order of steps in an ML pipeline?

Train → split → evaluate
Split → fit preprocessing on train → train model → evaluate on test
Evaluate first
Order doesn't matter

Why: Splitting first ensures the test set is truly unseen. Fitting preprocessing on train only prevents data leakage. Evaluate on the held-out test ONCE.

Q2. Why is R² ≈ 0.6 on California housing not necessarily bad?

Real-world tabular data often has irreducible noise — perfect prediction (R² = 1) is impossible
R² is meaningless for housing
sklearn computes R² wrong
0.6 is great accuracy

Why: House prices depend on many unmeasured factors (curb appeal, school district sentiment, recent news). No model with just 8 numerical features can capture all of it. 0.6 R² means we explain 60% of the variance — a reasonable baseline.

Q3. What would likely give better R² than linear regression on this dataset?

Adding more features blindly
Tree-based models (Random Forest, XGBoost) that capture non-linearities and feature interactions
Removing features
Using accuracy instead of R²

Why: Linear regression assumes a flat relationship; real housing prices have non-linear patterns and interactions. Trees handle these natively without feature engineering. Typically Random Forest gets ~0.8 R² on this dataset.

Common doubts¶

Should I report MSE or RMSE for housing?

RMSE — it's in dollars, directly interpretable. "Predictions are off by $40,000 RMSE" is clearer than "MSE is 1,600,000,000". Both rank models identically.

Why does scaling matter for linear regression here?

Doesn't affect predictions (linear regression is invariant to scaling), but makes coefficients comparable. Without scaling, "median income" might get a huge coefficient just because its scale is small — misleading for interpretation.

Is the California housing dataset still relevant?

For learning, yes — it's a classic benchmark, well-understood, fast to load. For production, you'd use real-time MLS data and many more features. But the pipeline shape (load → split → preprocess → train → evaluate) is identical.

→ Next: Try It Yourself