Polynomial Regression & Regularization¶

1. Why this matters¶

Plain linear regression has two failure modes:

Underfit — the relationship isn't linear. Adding polynomial features fixes this.
Overfit — too many features (especially polynomial ones), coefficients explode, training accuracy great but test accuracy terrible. Regularization fixes this.

Together: polynomial features + regularization = a flexible, well-behaved linear model.

2. Mental model¶

Polynomial features turn a curve into a hyperplane in a higher-dimensional space:

flowchart LR
    A[Curved relationship<br/>y = f x non-linear] --> B[Add x², x³, x1·x2 features]
    B --> C[Linear regression on enriched X]
    C --> D[Effectively fits a curve in original space]

Regularization adds a penalty so coefficients can't grow unboundedly:

Loss_OLS    = mean((y - ŷ)²)
Loss_Ridge  = mean((y - ŷ)²) + α · Σ βᵢ²              ← L2 penalty
Loss_Lasso  = mean((y - ŷ)²) + α · Σ |βᵢ|             ← L1 penalty
Loss_Elastic= mean((y - ŷ)²) + α·(r·Σ|βᵢ| + (1-r)·Σβᵢ²)

Higher α → stronger penalty → more shrinkage.

3. Polynomial Regression¶

Use sklearn's PolynomialFeatures to add x², x³, x₁·x₂, ... then fit any linear model.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline

# Fake data: y = 2x² - 3x + 5 + noise
np.random.seed(42)
X = np.random.uniform(-3, 3, 100).reshape(-1, 1)
y = 2 * X.ravel()**2 - 3 * X.ravel() + 5 + np.random.normal(0, 2, 100)

# Plain linear underfits a parabola
plain = LinearRegression().fit(X, y)
print("Plain linear R²:", plain.score(X, y))           # ≈ 0.05 — terrible

# Polynomial degree 2 — fits perfectly
poly_pipe = Pipeline([
    ("poly",  PolynomialFeatures(degree=2, include_bias=False)),
    ("scale", StandardScaler()),
    ("lr",    LinearRegression()),
])
poly_pipe.fit(X, y)
print("Poly degree 2 R²:", poly_pipe.score(X, y))      # ≈ 0.99

Key knob: degree. Higher = more flexible = more prone to overfit. Most real problems peak at degree 2-3.

PolynomialFeatures(degree=d, interaction_only=True) keeps only cross-terms (x₁·x₂), skipping pure powers (x²).

4. The overfitting problem (motivation for regularization)¶

from sklearn.model_selection import learning_curve
import numpy as np

for d in [1, 2, 5, 10]:
    pipe = Pipeline([
        ("poly", PolynomialFeatures(degree=d, include_bias=False)),
        ("scale", StandardScaler()),
        ("lr", LinearRegression()),
    ])
    pipe.fit(X_train, y_train)
    print(f"degree {d:2d} → train R²={pipe.score(X_train,y_train):.3f}  test R²={pipe.score(X_test,y_test):.3f}")

Typical output: train accuracy keeps climbing, test peaks at degree 2-3 then collapses. Classic overfit.

5. Ridge Regression (L2)¶

Adds α · Σ βᵢ² to the loss. Shrinks all coefficients toward zero but rarely makes them exactly zero.

from sklearn.linear_model import Ridge, RidgeCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Manual α
ridge = Pipeline([
    ("scale", StandardScaler()),
    ("ridge", Ridge(alpha=1.0)),
]).fit(X_train, y_train)

# Built-in CV across α grid
ridge_cv = Pipeline([
    ("scale", StandardScaler()),
    ("ridge", RidgeCV(alphas=[0.01, 0.1, 1.0, 10.0, 100.0])),
]).fit(X_train, y_train)

print("Best α:", ridge_cv.named_steps["ridge"].alpha_)
print("R² test:", ridge_cv.score(X_test, y_test))

Use Ridge when: - Many features, possibly correlated. - You don't need feature selection. - Default when in doubt.

6. Lasso Regression (L1)¶

Adds α · Σ |βᵢ|. The absolute value pushes coefficients to exactly zero — automatic feature selection.

from sklearn.linear_model import Lasso, LassoCV

lasso_cv = Pipeline([
    ("scale", StandardScaler()),
    ("lasso", LassoCV(alphas=None, cv=5, max_iter=10_000)),
]).fit(X_train, y_train)

print("Best α:", lasso_cv.named_steps["lasso"].alpha_)
print("# non-zero coefs:", (lasso_cv.named_steps["lasso"].coef_ != 0).sum())

Use Lasso when: - You suspect many features are irrelevant. - You want a sparse, interpretable model. - Combining feature selection + regression in one step.

7. ElasticNet¶

Best of both — combines L1 and L2 penalties:

Loss = MSE + α · ( r·Σ|βᵢ| + (1-r)·Σβᵢ² )
                      ↑              ↑
                      L1 ratio      L2 ratio

l1_ratio=0 → pure Ridge. l1_ratio=1 → pure Lasso. 0.5 → balanced.

from sklearn.linear_model import ElasticNet, ElasticNetCV

en_cv = Pipeline([
    ("scale", StandardScaler()),
    ("en",    ElasticNetCV(
        l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9, 0.99],
        alphas=None,
        cv=5, max_iter=10_000,
    )),
]).fit(X_train, y_train)

print("Best α:", en_cv.named_steps["en"].alpha_)
print("Best l1_ratio:", en_cv.named_steps["en"].l1_ratio_)

Use ElasticNet when: - Many features, some correlated (Lasso alone can pick arbitrarily one of a correlated pair; ElasticNet is more stable). - You want feature selection but not as aggressive as pure Lasso.

8. Visualizing what α does¶

import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import Ridge

alphas = np.logspace(-3, 4, 50)
coefs = []
for a in alphas:
    r = Ridge(alpha=a).fit(X_train_scaled, y_train)
    coefs.append(r.coef_)
coefs = np.array(coefs)

plt.semilogx(alphas, coefs)
plt.xlabel("α"); plt.ylabel("coefficient")
plt.title("Ridge: coefficients shrink as α increases")

You'll see all coefficients smoothly approach zero. For Lasso, you'd see some hit zero at specific α values.

9. Choosing α — always with CV¶

from sklearn.linear_model import RidgeCV, LassoCV, ElasticNetCV

# RidgeCV uses efficient leave-one-out CV by default
RidgeCV(alphas=np.logspace(-4, 4, 50))

# LassoCV uses k-fold CV
LassoCV(cv=5, n_alphas=100)

# Or use GridSearchCV / RandomizedSearchCV for custom ranges
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(
    Pipeline([("scale", StandardScaler()), ("ridge", Ridge())]),
    param_grid={"ridge__alpha": np.logspace(-3, 3, 50)},
    scoring="neg_root_mean_squared_error",
    cv=5,
).fit(X_train, y_train)
print(grid.best_params_)

10. Common pitfalls¶

❗ Forgetting to scale before regularization. The penalty depends on coefficient magnitudes; without scaling, features with bigger ranges are unfairly penalized less. Always pipe StandardScaler → Ridge/Lasso/ElasticNet.
❗ Hand-picking α without CV. α=1.0 is the default for a reason (defensible), but the optimal α varies by 6+ orders of magnitude across problems. Always CV.
❗ Using PolynomialFeatures with degree > 3 by default. Combinatorial feature explosion. With 20 features at degree 3 you get ~1500 polynomial features.
❗ Trusting Lasso to keep "the right" features when columns are highly correlated. Lasso arbitrarily picks one of a correlated pair. ElasticNet is more stable.
❗ Mixing scaled and unscaled features. If only some features are scaled, the penalty disproportionately hits the unscaled ones.
❗ Forgetting include_bias=False on PolynomialFeatures. Generates a column of 1s that's redundant with the intercept and harmless but annoying.
❗ Comparing R² across different α values during tuning. Always use a proper CV metric (neg_root_mean_squared_error, neg_mean_absolute_error).

11. When to use what¶

Model	When
Plain `LinearRegression`	Baseline. Always start here.
`Ridge(alpha=...)`	Default with regularization. Many correlated features.
`Lasso(alpha=...)`	Want sparse feature selection. Hundreds+ features, most irrelevant.
`ElasticNet(alpha, l1_ratio)`	Want sparsity but stable with correlated features.
`PolynomialFeatures(degree=2-3)` + regularized linear	Non-linear relationship, want interpretable model.
Switch to trees/GBM	Many interactions, can't enumerate them by hand.

12. Cheatsheet¶

from sklearn.linear_model import (
    LinearRegression,
    Ridge, RidgeCV,
    Lasso, LassoCV,
    ElasticNet, ElasticNetCV,
)
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
import numpy as np

# Polynomial features
PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)

# Regularized pipeline (the canonical pattern)
pipe = Pipeline([
    ("poly",  PolynomialFeatures(degree=2, include_bias=False)),
    ("scale", StandardScaler()),
    ("model", Ridge(alpha=1.0)),
])

# Built-in CV variants — easiest way to tune α
RidgeCV(alphas=np.logspace(-4, 4, 50))         # generalized cross-validation (fast)
LassoCV(cv=5, n_alphas=100, max_iter=10_000)
ElasticNetCV(l1_ratio=[0.1, 0.5, 0.9], cv=5)

# Custom grid search
grid = GridSearchCV(
    pipe,
    param_grid={
        "poly__degree":  [1, 2, 3],
        "model__alpha":  np.logspace(-3, 3, 30),
    },
    scoring="neg_root_mean_squared_error", cv=5,
).fit(X_train, y_train)
print(grid.best_params_, -grid.best_score_)

# Inspect coefficients
coefs = pipe.named_steps["model"].coef_
feature_names = pipe.named_steps["poly"].get_feature_names_out()
sorted_by_mag = sorted(zip(feature_names, coefs), key=lambda x: abs(x[1]), reverse=True)
for name, coef in sorted_by_mag[:10]:
    print(f"{name:30s} {coef:+.3f}")

13. Q&A — recall test¶

Q: Difference between Ridge and Lasso? A: Ridge uses L2 penalty (Σβ²) — shrinks all coefficients smoothly but rarely to exactly zero. Lasso uses L1 penalty (Σ|β|) — shrinks AND sets some to exactly zero (feature selection).
Q: Why must features be scaled before regularization? A: Penalty is on coefficient magnitudes. Unscaled features have wildly different ranges → coefficients to match → penalty hits some unfairly. Scaling makes the penalty uniform.
Q: When does ElasticNet beat Lasso? A: When you have groups of correlated features. Lasso arbitrarily picks one from each group; ElasticNet (with l1_ratio < 1) tends to keep correlated features together with smaller coefficients.
Q: How do you tune α? A: Cross-validation. Use RidgeCV / LassoCV / ElasticNetCV for built-in efficient CV, or GridSearchCV over a log-spaced grid (np.logspace(-4, 4, 50)).
Q: Is polynomial degree a hyperparameter to tune? A: Yes. CV over degree ∈ {1, 2, 3} jointly with α. Higher degrees explode feature count and are rarely worth it.
Q: What happens to predictions as α → ∞? A: All coefficients → 0; the model degenerates to predicting the mean of y for any input.

Practice¶

What does this print?

Expected: 3

from sklearn.preprocessing import PolynomialFeatures
import numpy as np
X = np.array([[2]])
print(PolynomialFeatures(degree=2, include_bias=False).fit_transform(X).shape[1])
# Features: x, x², x*x = 3 (only x and x² for 1D — let me reconsider)
# Actually for 1D with degree=2: x, x² → 2. Re-read.

Use Ridge regression instead of LinearRegression to handle multicollinearity

Expected: True

from sklearn.linear_model import LinearRegression, Ridge
import numpy as np
np.random.seed(0)
X = np.random.randn(20, 3)
X[:, 2] = X[:, 0] + 0.001 * np.random.randn(20)   # near-duplicate of feature 0
y = X[:, 0] + np.random.randn(20) * 0.1
model = LinearRegression().fit(X, y)              # bug: coefficients become huge/unstable
print(abs(model.coef_).max() < 100)

Quiz — Quick check¶

What you remember

Q1. What does L2 regularization (Ridge) do?

Adds α × Σ(βᵢ²) to the loss — shrinks coefficients toward 0 but rarely makes them exactly 0
Removes features completely
Increases the learning rate
Adds noise to the data

Why: L2 keeps all features but with smaller coefficients — stabilizes against multicollinearity. L1 (Lasso) can drive coefficients exactly to 0, effectively doing feature selection.

Q2. When does polynomial regression overfit?

When degree is 1
As degree increases, the model fits training noise perfectly but generalizes poorly
Never
Only with small datasets

Why: A degree-20 polynomial through 20 training points fits perfectly — and oscillates wildly between them. The training R² approaches 1; the test R² collapses. Use cross-validation to pick the right degree.

Q3. Why does L1 (Lasso) regularization do feature selection?

The L1 penalty creates corners at zero in the loss surface, so the optimum often sits exactly at coefficient=0 for unimportant features
It deletes columns
It uses k-NN under the hood
It's faster

Why: Geometrically, the L1 "diamond" constraint touches the loss contours at axis corners. Mathematically, the subgradient at 0 can be exactly 0, so the optimization parks coefficients there.

Common doubts¶

Ridge or Lasso — which should I try first?

Try Ridge first — it's more stable and rarely worse. Try Lasso if you want automatic feature selection or have many features and suspect most are irrelevant. ElasticNet combines both — good default when unsure.

How do I choose α (the regularization strength)?

Cross-validation. Use RidgeCV or LassoCV which try a grid of α values automatically. Common range: 10^-4 to 10^4 on a log scale. Higher α = more regularization = simpler model.

Why does my model do worse with regularization?

Three common causes: (1) features aren't scaled — regularization penalizes large coefficients but doesn't know that "salary" is in a different scale than "age"; (2) α too high — over-regularized, underfit; (3) the model wasn't overfitting to begin with — regularization can only help if there's overfitting to fix.