Polynomial Regression & Regularization¶
1. Why this matters¶
Plain linear regression has two failure modes:
- Underfit — the relationship isn't linear. Adding polynomial features fixes this.
- Overfit — too many features (especially polynomial ones), coefficients explode, training accuracy great but test accuracy terrible. Regularization fixes this.
Together: polynomial features + regularization = a flexible, well-behaved linear model.
2. Mental model¶
Polynomial features turn a curve into a hyperplane in a higher-dimensional space:
flowchart LR
A[Curved relationship<br/>y = f x non-linear] --> B[Add x², x³, x1·x2 features]
B --> C[Linear regression on enriched X]
C --> D[Effectively fits a curve in original space]
Regularization adds a penalty so coefficients can't grow unboundedly:
Loss_OLS = mean((y - ŷ)²)
Loss_Ridge = mean((y - ŷ)²) + α · Σ βᵢ² ← L2 penalty
Loss_Lasso = mean((y - ŷ)²) + α · Σ |βᵢ| ← L1 penalty
Loss_Elastic= mean((y - ŷ)²) + α·(r·Σ|βᵢ| + (1-r)·Σβᵢ²)
Higher α → stronger penalty → more shrinkage.
3. Polynomial Regression¶
Use sklearn's PolynomialFeatures to add x², x³, x₁·x₂, ... then fit any linear model.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
# Fake data: y = 2x² - 3x + 5 + noise
np.random.seed(42)
X = np.random.uniform(-3, 3, 100).reshape(-1, 1)
y = 2 * X.ravel()**2 - 3 * X.ravel() + 5 + np.random.normal(0, 2, 100)
# Plain linear underfits a parabola
plain = LinearRegression().fit(X, y)
print("Plain linear R²:", plain.score(X, y)) # ≈ 0.05 — terrible
# Polynomial degree 2 — fits perfectly
poly_pipe = Pipeline([
("poly", PolynomialFeatures(degree=2, include_bias=False)),
("scale", StandardScaler()),
("lr", LinearRegression()),
])
poly_pipe.fit(X, y)
print("Poly degree 2 R²:", poly_pipe.score(X, y)) # ≈ 0.99
Key knob: degree. Higher = more flexible = more prone to overfit. Most real problems peak at degree 2-3.
PolynomialFeatures(degree=d, interaction_only=True) keeps only cross-terms (x₁·x₂), skipping pure powers (x²).
4. The overfitting problem (motivation for regularization)¶
from sklearn.model_selection import learning_curve
import numpy as np
for d in [1, 2, 5, 10]:
pipe = Pipeline([
("poly", PolynomialFeatures(degree=d, include_bias=False)),
("scale", StandardScaler()),
("lr", LinearRegression()),
])
pipe.fit(X_train, y_train)
print(f"degree {d:2d} → train R²={pipe.score(X_train,y_train):.3f} test R²={pipe.score(X_test,y_test):.3f}")
Typical output: train accuracy keeps climbing, test peaks at degree 2-3 then collapses. Classic overfit.
5. Ridge Regression (L2)¶
Adds α · Σ βᵢ² to the loss. Shrinks all coefficients toward zero but rarely makes them exactly zero.
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# Manual α
ridge = Pipeline([
("scale", StandardScaler()),
("ridge", Ridge(alpha=1.0)),
]).fit(X_train, y_train)
# Built-in CV across α grid
ridge_cv = Pipeline([
("scale", StandardScaler()),
("ridge", RidgeCV(alphas=[0.01, 0.1, 1.0, 10.0, 100.0])),
]).fit(X_train, y_train)
print("Best α:", ridge_cv.named_steps["ridge"].alpha_)
print("R² test:", ridge_cv.score(X_test, y_test))
Use Ridge when: - Many features, possibly correlated. - You don't need feature selection. - Default when in doubt.
6. Lasso Regression (L1)¶
Adds α · Σ |βᵢ|. The absolute value pushes coefficients to exactly zero — automatic feature selection.
from sklearn.linear_model import Lasso, LassoCV
lasso_cv = Pipeline([
("scale", StandardScaler()),
("lasso", LassoCV(alphas=None, cv=5, max_iter=10_000)),
]).fit(X_train, y_train)
print("Best α:", lasso_cv.named_steps["lasso"].alpha_)
print("# non-zero coefs:", (lasso_cv.named_steps["lasso"].coef_ != 0).sum())
Use Lasso when: - You suspect many features are irrelevant. - You want a sparse, interpretable model. - Combining feature selection + regression in one step.
7. ElasticNet¶
Best of both — combines L1 and L2 penalties:
l1_ratio=0 → pure Ridge. l1_ratio=1 → pure Lasso. 0.5 → balanced.
from sklearn.linear_model import ElasticNet, ElasticNetCV
en_cv = Pipeline([
("scale", StandardScaler()),
("en", ElasticNetCV(
l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9, 0.99],
alphas=None,
cv=5, max_iter=10_000,
)),
]).fit(X_train, y_train)
print("Best α:", en_cv.named_steps["en"].alpha_)
print("Best l1_ratio:", en_cv.named_steps["en"].l1_ratio_)
Use ElasticNet when: - Many features, some correlated (Lasso alone can pick arbitrarily one of a correlated pair; ElasticNet is more stable). - You want feature selection but not as aggressive as pure Lasso.
8. Visualizing what α does¶
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import Ridge
alphas = np.logspace(-3, 4, 50)
coefs = []
for a in alphas:
r = Ridge(alpha=a).fit(X_train_scaled, y_train)
coefs.append(r.coef_)
coefs = np.array(coefs)
plt.semilogx(alphas, coefs)
plt.xlabel("α"); plt.ylabel("coefficient")
plt.title("Ridge: coefficients shrink as α increases")
You'll see all coefficients smoothly approach zero. For Lasso, you'd see some hit zero at specific α values.
9. Choosing α — always with CV¶
from sklearn.linear_model import RidgeCV, LassoCV, ElasticNetCV
# RidgeCV uses efficient leave-one-out CV by default
RidgeCV(alphas=np.logspace(-4, 4, 50))
# LassoCV uses k-fold CV
LassoCV(cv=5, n_alphas=100)
# Or use GridSearchCV / RandomizedSearchCV for custom ranges
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(
Pipeline([("scale", StandardScaler()), ("ridge", Ridge())]),
param_grid={"ridge__alpha": np.logspace(-3, 3, 50)},
scoring="neg_root_mean_squared_error",
cv=5,
).fit(X_train, y_train)
print(grid.best_params_)
10. Common pitfalls¶
- ❗ Forgetting to scale before regularization. The penalty depends on coefficient magnitudes; without scaling, features with bigger ranges are unfairly penalized less. Always pipe
StandardScaler→Ridge/Lasso/ElasticNet. - ❗ Hand-picking α without CV.
α=1.0is the default for a reason (defensible), but the optimal α varies by 6+ orders of magnitude across problems. Always CV. - ❗ Using PolynomialFeatures with degree > 3 by default. Combinatorial feature explosion. With 20 features at degree 3 you get ~1500 polynomial features.
- ❗ Trusting Lasso to keep "the right" features when columns are highly correlated. Lasso arbitrarily picks one of a correlated pair. ElasticNet is more stable.
- ❗ Mixing scaled and unscaled features. If only some features are scaled, the penalty disproportionately hits the unscaled ones.
- ❗ Forgetting
include_bias=FalseonPolynomialFeatures. Generates a column of 1s that's redundant with the intercept and harmless but annoying. - ❗ Comparing R² across different α values during tuning. Always use a proper CV metric (
neg_root_mean_squared_error,neg_mean_absolute_error).
11. When to use what¶
| Model | When |
|---|---|
Plain LinearRegression |
Baseline. Always start here. |
Ridge(alpha=...) |
Default with regularization. Many correlated features. |
Lasso(alpha=...) |
Want sparse feature selection. Hundreds+ features, most irrelevant. |
ElasticNet(alpha, l1_ratio) |
Want sparsity but stable with correlated features. |
PolynomialFeatures(degree=2-3) + regularized linear |
Non-linear relationship, want interpretable model. |
| Switch to trees/GBM | Many interactions, can't enumerate them by hand. |
12. Cheatsheet¶
from sklearn.linear_model import (
LinearRegression,
Ridge, RidgeCV,
Lasso, LassoCV,
ElasticNet, ElasticNetCV,
)
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
import numpy as np
# Polynomial features
PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
# Regularized pipeline (the canonical pattern)
pipe = Pipeline([
("poly", PolynomialFeatures(degree=2, include_bias=False)),
("scale", StandardScaler()),
("model", Ridge(alpha=1.0)),
])
# Built-in CV variants — easiest way to tune α
RidgeCV(alphas=np.logspace(-4, 4, 50)) # generalized cross-validation (fast)
LassoCV(cv=5, n_alphas=100, max_iter=10_000)
ElasticNetCV(l1_ratio=[0.1, 0.5, 0.9], cv=5)
# Custom grid search
grid = GridSearchCV(
pipe,
param_grid={
"poly__degree": [1, 2, 3],
"model__alpha": np.logspace(-3, 3, 30),
},
scoring="neg_root_mean_squared_error", cv=5,
).fit(X_train, y_train)
print(grid.best_params_, -grid.best_score_)
# Inspect coefficients
coefs = pipe.named_steps["model"].coef_
feature_names = pipe.named_steps["poly"].get_feature_names_out()
sorted_by_mag = sorted(zip(feature_names, coefs), key=lambda x: abs(x[1]), reverse=True)
for name, coef in sorted_by_mag[:10]:
print(f"{name:30s} {coef:+.3f}")
13. Q&A — recall test¶
-
Q: Difference between Ridge and Lasso? A: Ridge uses L2 penalty (Σβ²) — shrinks all coefficients smoothly but rarely to exactly zero. Lasso uses L1 penalty (Σ|β|) — shrinks AND sets some to exactly zero (feature selection).
-
Q: Why must features be scaled before regularization? A: Penalty is on coefficient magnitudes. Unscaled features have wildly different ranges → coefficients to match → penalty hits some unfairly. Scaling makes the penalty uniform.
-
Q: When does ElasticNet beat Lasso? A: When you have groups of correlated features. Lasso arbitrarily picks one from each group; ElasticNet (with l1_ratio < 1) tends to keep correlated features together with smaller coefficients.
-
Q: How do you tune α? A: Cross-validation. Use
RidgeCV/LassoCV/ElasticNetCVfor built-in efficient CV, orGridSearchCVover a log-spaced grid (np.logspace(-4, 4, 50)). -
Q: Is polynomial degree a hyperparameter to tune? A: Yes. CV over
degree ∈ {1, 2, 3}jointly withα. Higher degrees explode feature count and are rarely worth it. -
Q: What happens to predictions as
α → ∞? A: All coefficients → 0; the model degenerates to predicting the mean ofyfor any input.
Practice¶
What does this print?
Expected: 3
Use Ridge regression instead of LinearRegression to handle multicollinearity
Expected: True
from sklearn.linear_model import LinearRegression, Ridge
import numpy as np
np.random.seed(0)
X = np.random.randn(20, 3)
X[:, 2] = X[:, 0] + 0.001 * np.random.randn(20) # near-duplicate of feature 0
y = X[:, 0] + np.random.randn(20) * 0.1
model = LinearRegression().fit(X, y) # bug: coefficients become huge/unstable
print(abs(model.coef_).max() < 100)
Quiz — Quick check¶
What you remember
Q1. What does L2 regularization (Ridge) do?
- Adds
α × Σ(βᵢ²)to the loss — shrinks coefficients toward 0 but rarely makes them exactly 0 - Removes features completely
- Increases the learning rate
- Adds noise to the data
Why: L2 keeps all features but with smaller coefficients — stabilizes against multicollinearity. L1 (Lasso) can drive coefficients exactly to 0, effectively doing feature selection.
Q2. When does polynomial regression overfit?
- When degree is 1
- As degree increases, the model fits training noise perfectly but generalizes poorly
- Never
- Only with small datasets
Why: A degree-20 polynomial through 20 training points fits perfectly — and oscillates wildly between them. The training R² approaches 1; the test R² collapses. Use cross-validation to pick the right degree.
Q3. Why does L1 (Lasso) regularization do feature selection?
- The L1 penalty creates corners at zero in the loss surface, so the optimum often sits exactly at coefficient=0 for unimportant features
- It deletes columns
- It uses k-NN under the hood
- It's faster
Why: Geometrically, the L1 "diamond" constraint touches the loss contours at axis corners. Mathematically, the subgradient at 0 can be exactly 0, so the optimization parks coefficients there.
Common doubts¶
Ridge or Lasso — which should I try first?
Try Ridge first — it's more stable and rarely worse. Try Lasso if you want automatic feature selection or have many features and suspect most are irrelevant. ElasticNet combines both — good default when unsure.
How do I choose α (the regularization strength)?
Cross-validation. Use RidgeCV or LassoCV which try a grid of α values automatically. Common range: 10^-4 to 10^4 on a log scale. Higher α = more regularization = simpler model.
Why does my model do worse with regularization?
Three common causes: (1) features aren't scaled — regularization penalizes large coefficients but doesn't know that "salary" is in a different scale than "age"; (2) α too high — over-regularized, underfit; (3) the model wasn't overfitting to begin with — regularization can only help if there's overfitting to fix.