Skip to content

Polynomial Regression & Regularization

1. Why this matters

Plain linear regression has two failure modes:

  1. Underfit — the relationship isn't linear. Adding polynomial features fixes this.
  2. Overfit — too many features (especially polynomial ones), coefficients explode, training accuracy great but test accuracy terrible. Regularization fixes this.

Together: polynomial features + regularization = a flexible, well-behaved linear model.

2. Mental model

Polynomial features turn a curve into a hyperplane in a higher-dimensional space:

flowchart LR
    A[Curved relationship<br/>y = f x non-linear] --> B[Add x², x³, x1·x2 features]
    B --> C[Linear regression on enriched X]
    C --> D[Effectively fits a curve in original space]

Regularization adds a penalty so coefficients can't grow unboundedly:

Loss_OLS    = mean((y - ŷ)²)
Loss_Ridge  = mean((y - ŷ)²) + α · Σ βᵢ²              ← L2 penalty
Loss_Lasso  = mean((y - ŷ)²) + α · Σ |βᵢ|             ← L1 penalty
Loss_Elastic= mean((y - ŷ)²) + α·(r·Σ|βᵢ| + (1-r)·Σβᵢ²)

Higher α → stronger penalty → more shrinkage.

3. Polynomial Regression

Use sklearn's PolynomialFeatures to add x², x³, x₁·x₂, ... then fit any linear model.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline

# Fake data: y = 2x² - 3x + 5 + noise
np.random.seed(42)
X = np.random.uniform(-3, 3, 100).reshape(-1, 1)
y = 2 * X.ravel()**2 - 3 * X.ravel() + 5 + np.random.normal(0, 2, 100)

# Plain linear underfits a parabola
plain = LinearRegression().fit(X, y)
print("Plain linear R²:", plain.score(X, y))           # ≈ 0.05 — terrible

# Polynomial degree 2 — fits perfectly
poly_pipe = Pipeline([
    ("poly",  PolynomialFeatures(degree=2, include_bias=False)),
    ("scale", StandardScaler()),
    ("lr",    LinearRegression()),
])
poly_pipe.fit(X, y)
print("Poly degree 2 R²:", poly_pipe.score(X, y))      # ≈ 0.99

Key knob: degree. Higher = more flexible = more prone to overfit. Most real problems peak at degree 2-3.

PolynomialFeatures(degree=d, interaction_only=True) keeps only cross-terms (x₁·x₂), skipping pure powers ().

4. The overfitting problem (motivation for regularization)

from sklearn.model_selection import learning_curve
import numpy as np

for d in [1, 2, 5, 10]:
    pipe = Pipeline([
        ("poly", PolynomialFeatures(degree=d, include_bias=False)),
        ("scale", StandardScaler()),
        ("lr", LinearRegression()),
    ])
    pipe.fit(X_train, y_train)
    print(f"degree {d:2d} → train R²={pipe.score(X_train,y_train):.3f}  test R²={pipe.score(X_test,y_test):.3f}")

Typical output: train accuracy keeps climbing, test peaks at degree 2-3 then collapses. Classic overfit.

5. Ridge Regression (L2)

Adds α · Σ βᵢ² to the loss. Shrinks all coefficients toward zero but rarely makes them exactly zero.

from sklearn.linear_model import Ridge, RidgeCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Manual α
ridge = Pipeline([
    ("scale", StandardScaler()),
    ("ridge", Ridge(alpha=1.0)),
]).fit(X_train, y_train)

# Built-in CV across α grid
ridge_cv = Pipeline([
    ("scale", StandardScaler()),
    ("ridge", RidgeCV(alphas=[0.01, 0.1, 1.0, 10.0, 100.0])),
]).fit(X_train, y_train)

print("Best α:", ridge_cv.named_steps["ridge"].alpha_)
print("R² test:", ridge_cv.score(X_test, y_test))

Use Ridge when: - Many features, possibly correlated. - You don't need feature selection. - Default when in doubt.

6. Lasso Regression (L1)

Adds α · Σ |βᵢ|. The absolute value pushes coefficients to exactly zero — automatic feature selection.

from sklearn.linear_model import Lasso, LassoCV

lasso_cv = Pipeline([
    ("scale", StandardScaler()),
    ("lasso", LassoCV(alphas=None, cv=5, max_iter=10_000)),
]).fit(X_train, y_train)

print("Best α:", lasso_cv.named_steps["lasso"].alpha_)
print("# non-zero coefs:", (lasso_cv.named_steps["lasso"].coef_ != 0).sum())

Use Lasso when: - You suspect many features are irrelevant. - You want a sparse, interpretable model. - Combining feature selection + regression in one step.

7. ElasticNet

Best of both — combines L1 and L2 penalties:

Loss = MSE + α · ( r·Σ|βᵢ| + (1-r)·Σβᵢ² )
                      ↑              ↑
                      L1 ratio      L2 ratio

l1_ratio=0 → pure Ridge. l1_ratio=1 → pure Lasso. 0.5 → balanced.

from sklearn.linear_model import ElasticNet, ElasticNetCV

en_cv = Pipeline([
    ("scale", StandardScaler()),
    ("en",    ElasticNetCV(
        l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9, 0.99],
        alphas=None,
        cv=5, max_iter=10_000,
    )),
]).fit(X_train, y_train)

print("Best α:", en_cv.named_steps["en"].alpha_)
print("Best l1_ratio:", en_cv.named_steps["en"].l1_ratio_)

Use ElasticNet when: - Many features, some correlated (Lasso alone can pick arbitrarily one of a correlated pair; ElasticNet is more stable). - You want feature selection but not as aggressive as pure Lasso.

8. Visualizing what α does

import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import Ridge

alphas = np.logspace(-3, 4, 50)
coefs = []
for a in alphas:
    r = Ridge(alpha=a).fit(X_train_scaled, y_train)
    coefs.append(r.coef_)
coefs = np.array(coefs)

plt.semilogx(alphas, coefs)
plt.xlabel("α"); plt.ylabel("coefficient")
plt.title("Ridge: coefficients shrink as α increases")

You'll see all coefficients smoothly approach zero. For Lasso, you'd see some hit zero at specific α values.

9. Choosing α — always with CV

from sklearn.linear_model import RidgeCV, LassoCV, ElasticNetCV

# RidgeCV uses efficient leave-one-out CV by default
RidgeCV(alphas=np.logspace(-4, 4, 50))

# LassoCV uses k-fold CV
LassoCV(cv=5, n_alphas=100)

# Or use GridSearchCV / RandomizedSearchCV for custom ranges
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(
    Pipeline([("scale", StandardScaler()), ("ridge", Ridge())]),
    param_grid={"ridge__alpha": np.logspace(-3, 3, 50)},
    scoring="neg_root_mean_squared_error",
    cv=5,
).fit(X_train, y_train)
print(grid.best_params_)

10. Common pitfalls

  • Forgetting to scale before regularization. The penalty depends on coefficient magnitudes; without scaling, features with bigger ranges are unfairly penalized less. Always pipe StandardScalerRidge/Lasso/ElasticNet.
  • Hand-picking α without CV. α=1.0 is the default for a reason (defensible), but the optimal α varies by 6+ orders of magnitude across problems. Always CV.
  • Using PolynomialFeatures with degree > 3 by default. Combinatorial feature explosion. With 20 features at degree 3 you get ~1500 polynomial features.
  • Trusting Lasso to keep "the right" features when columns are highly correlated. Lasso arbitrarily picks one of a correlated pair. ElasticNet is more stable.
  • Mixing scaled and unscaled features. If only some features are scaled, the penalty disproportionately hits the unscaled ones.
  • Forgetting include_bias=False on PolynomialFeatures. Generates a column of 1s that's redundant with the intercept and harmless but annoying.
  • Comparing R² across different α values during tuning. Always use a proper CV metric (neg_root_mean_squared_error, neg_mean_absolute_error).

11. When to use what

Model When
Plain LinearRegression Baseline. Always start here.
Ridge(alpha=...) Default with regularization. Many correlated features.
Lasso(alpha=...) Want sparse feature selection. Hundreds+ features, most irrelevant.
ElasticNet(alpha, l1_ratio) Want sparsity but stable with correlated features.
PolynomialFeatures(degree=2-3) + regularized linear Non-linear relationship, want interpretable model.
Switch to trees/GBM Many interactions, can't enumerate them by hand.

12. Cheatsheet

from sklearn.linear_model import (
    LinearRegression,
    Ridge, RidgeCV,
    Lasso, LassoCV,
    ElasticNet, ElasticNetCV,
)
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
import numpy as np

# Polynomial features
PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)

# Regularized pipeline (the canonical pattern)
pipe = Pipeline([
    ("poly",  PolynomialFeatures(degree=2, include_bias=False)),
    ("scale", StandardScaler()),
    ("model", Ridge(alpha=1.0)),
])

# Built-in CV variants — easiest way to tune α
RidgeCV(alphas=np.logspace(-4, 4, 50))         # generalized cross-validation (fast)
LassoCV(cv=5, n_alphas=100, max_iter=10_000)
ElasticNetCV(l1_ratio=[0.1, 0.5, 0.9], cv=5)

# Custom grid search
grid = GridSearchCV(
    pipe,
    param_grid={
        "poly__degree":  [1, 2, 3],
        "model__alpha":  np.logspace(-3, 3, 30),
    },
    scoring="neg_root_mean_squared_error", cv=5,
).fit(X_train, y_train)
print(grid.best_params_, -grid.best_score_)

# Inspect coefficients
coefs = pipe.named_steps["model"].coef_
feature_names = pipe.named_steps["poly"].get_feature_names_out()
sorted_by_mag = sorted(zip(feature_names, coefs), key=lambda x: abs(x[1]), reverse=True)
for name, coef in sorted_by_mag[:10]:
    print(f"{name:30s} {coef:+.3f}")

13. Q&A — recall test

  • Q: Difference between Ridge and Lasso? A: Ridge uses L2 penalty (Σβ²) — shrinks all coefficients smoothly but rarely to exactly zero. Lasso uses L1 penalty (Σ|β|) — shrinks AND sets some to exactly zero (feature selection).

  • Q: Why must features be scaled before regularization? A: Penalty is on coefficient magnitudes. Unscaled features have wildly different ranges → coefficients to match → penalty hits some unfairly. Scaling makes the penalty uniform.

  • Q: When does ElasticNet beat Lasso? A: When you have groups of correlated features. Lasso arbitrarily picks one from each group; ElasticNet (with l1_ratio < 1) tends to keep correlated features together with smaller coefficients.

  • Q: How do you tune α? A: Cross-validation. Use RidgeCV / LassoCV / ElasticNetCV for built-in efficient CV, or GridSearchCV over a log-spaced grid (np.logspace(-4, 4, 50)).

  • Q: Is polynomial degree a hyperparameter to tune? A: Yes. CV over degree ∈ {1, 2, 3} jointly with α. Higher degrees explode feature count and are rarely worth it.

  • Q: What happens to predictions as α → ∞? A: All coefficients → 0; the model degenerates to predicting the mean of y for any input.

Practice

What does this print?

Expected: 3

from sklearn.preprocessing import PolynomialFeatures
import numpy as np
X = np.array([[2]])
print(PolynomialFeatures(degree=2, include_bias=False).fit_transform(X).shape[1])
# Features: x, x², x*x = 3 (only x and x² for 1D — let me reconsider)
# Actually for 1D with degree=2: x, x² → 2. Re-read.

Use Ridge regression instead of LinearRegression to handle multicollinearity

Expected: True

from sklearn.linear_model import LinearRegression, Ridge
import numpy as np
np.random.seed(0)
X = np.random.randn(20, 3)
X[:, 2] = X[:, 0] + 0.001 * np.random.randn(20)   # near-duplicate of feature 0
y = X[:, 0] + np.random.randn(20) * 0.1
model = LinearRegression().fit(X, y)              # bug: coefficients become huge/unstable
print(abs(model.coef_).max() < 100)

Quiz — Quick check

What you remember

Q1. What does L2 regularization (Ridge) do?

  • Adds α × Σ(βᵢ²) to the loss — shrinks coefficients toward 0 but rarely makes them exactly 0
  • Removes features completely
  • Increases the learning rate
  • Adds noise to the data

Why: L2 keeps all features but with smaller coefficients — stabilizes against multicollinearity. L1 (Lasso) can drive coefficients exactly to 0, effectively doing feature selection.

Q2. When does polynomial regression overfit?

  • When degree is 1
  • As degree increases, the model fits training noise perfectly but generalizes poorly
  • Never
  • Only with small datasets

Why: A degree-20 polynomial through 20 training points fits perfectly — and oscillates wildly between them. The training R² approaches 1; the test R² collapses. Use cross-validation to pick the right degree.

Q3. Why does L1 (Lasso) regularization do feature selection?

  • The L1 penalty creates corners at zero in the loss surface, so the optimum often sits exactly at coefficient=0 for unimportant features
  • It deletes columns
  • It uses k-NN under the hood
  • It's faster

Why: Geometrically, the L1 "diamond" constraint touches the loss contours at axis corners. Mathematically, the subgradient at 0 can be exactly 0, so the optimization parks coefficients there.

Common doubts

Ridge or Lasso — which should I try first?

Try Ridge first — it's more stable and rarely worse. Try Lasso if you want automatic feature selection or have many features and suspect most are irrelevant. ElasticNet combines both — good default when unsure.

How do I choose α (the regularization strength)?

Cross-validation. Use RidgeCV or LassoCV which try a grid of α values automatically. Common range: 10^-4 to 10^4 on a log scale. Higher α = more regularization = simpler model.

Why does my model do worse with regularization?

Three common causes: (1) features aren't scaled — regularization penalizes large coefficients but doesn't know that "salary" is in a different scale than "age"; (2) α too high — over-regularized, underfit; (3) the model wasn't overfitting to begin with — regularization can only help if there's overfitting to fix.