Gradient Descent¶
Gradient descent is the algorithm sklearn (and almost every ML library) uses behind the scenes to find the best coefficients when there's no shortcut formula.
The intuition¶
Picture the loss (how wrong the model is) as a hilly landscape. The goal: find the lowest point.
You start somewhere random. At each step, you look at the slope under your feet and take a small step downhill. Repeat until you can't go any lower.
The recipe¶
1. Initialize β to zeros (or randomly).
2. Compute predictions: ŷ = X · β
3. Compute the gradient (slope of the loss curve at current β).
4. Update: β ← β − η · gradient (η is the "learning rate")
5. Repeat steps 2-4 until convergence.
- A small
η→ tiny steps, slow but stable. - A big
η→ giant steps, fast but can overshoot the minimum.
Manual gradient descent — runnable¶
Below is the entire algorithm in ~15 lines of NumPy. Run it and watch the loss drop on every iteration.
import numpy as np
# Fake data: y = 3x + 2 + noise
np.random.seed(0)
X = np.random.uniform(0, 10, 50).reshape(-1, 1)
y = 3 * X.ravel() + 2 + np.random.normal(0, 1, 50)
# Add a column of 1s so β₀ (intercept) becomes part of β
X1 = np.c_[np.ones(len(X)), X]
beta = np.zeros(2) # [intercept, slope]
lr = 0.01
n = len(X)
for step in range(50):
y_pred = X1 @ beta
grad = (2 / n) * X1.T @ (y_pred - y)
beta -= lr * grad
loss = np.mean((y_pred - y) ** 2)
if step % 10 == 0:
print(f"step {step:3d} | loss={loss:6.2f} | β₀={beta[0]:.2f} β₁={beta[1]:.2f}")
print("\nFinal:", beta)
print("Target was approximately [2, 3]")
You should see β₀ approach 2 and β₁ approach 3 as the loss shrinks.
When is gradient descent the right choice?¶
- ✅ Big datasets (millions of rows) — the closed-form formula needs an
O(p³)matrix inverse that's slow at scale. - ✅ Online / streaming learning — you can update one sample at a time.
- ✅ Any model that doesn't have a closed-form solution — neural networks, logistic regression, gradient boosting.
- ❌ Small datasets — closed-form (
LinearRegression()) is exact and instant.
Sklearn's SGD version¶
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler
import numpy as np
np.random.seed(0)
X = np.random.uniform(0, 10, 100).reshape(-1, 1)
y = 3 * X.ravel() + 2 + np.random.normal(0, 1, 100)
# Always scale before SGD
X_scaled = StandardScaler().fit_transform(X)
sgd = SGDRegressor(
learning_rate="invscaling", # auto-decaying learning rate
eta0=0.01,
max_iter=200,
random_state=42,
).fit(X_scaled, y)
print("coef:", sgd.coef_)
print("intercept:", sgd.intercept_)
What you learned¶
- Gradient descent iteratively updates
βby walking downhill on the loss surface. - The learning rate
ηis the most important knob. SGDRegressoris sklearn's gradient-descent-based regressor.- Always scale your features before SGD. Without scaling, the loss landscape is stretched and convergence is poor.
Practice¶
What does this print?
Expected: True
Set a small enough learning rate so the loss decreases (currently it diverges)
Expected: True
Quiz — Quick check¶
What you remember
Q1. What does gradient descent compute at each step?
- A random update
- The gradient (direction of steepest increase) and steps in the OPPOSITE direction
- The exact minimum
- The mean of the data
Why: Negative gradient points "downhill" on the loss surface. We take a step in that direction, scaled by the learning rate.
Q2. What happens with a learning rate that's too LARGE?
- Faster convergence
- The optimizer oscillates or diverges (loss grows instead of shrinks)
- No effect
- Always finds the global minimum
Why: Imagine standing on one side of a valley, taking a huge step — you overshoot to the other side. Repeat → oscillation or divergence. Fix: smaller lr, or use adaptive optimizers (Adam, AdaGrad) that auto-scale.
Q3. Why is gradient descent the standard optimization algorithm in ML?
- Always finds the global minimum
- Works for high-dimensional problems where closed-form solutions don't exist or are too expensive
- Requires no math
- Faster than all alternatives
Why: Deep neural networks have millions of parameters — no closed form. Gradient descent (and its variants like SGD, Adam) is the only practical optimizer at that scale.
Common doubts¶
How do I choose the learning rate?
Start with 0.01 or 0.001. If loss explodes → too high, divide by 10. If loss decreases slowly → too low, multiply by 10. For production, use adaptive optimizers (Adam, AdamW) that adjust the learning rate per parameter — fewer hyperparameters to tune.
What's the difference between gradient descent and least squares?
Least squares solves linear regression with a closed-form formula β = (XᵀX)⁻¹ Xᵀy — exact answer in one step, but only works for linear models and small data. Gradient descent works iteratively, handles arbitrary loss functions, and scales to massive models (it's how deep learning trains).
Why does sklearn use the closed form for LinearRegression but SGD for SGDRegressor?
LinearRegression is fast and exact for small/medium datasets. SGDRegressor is iterative (stochastic gradient descent) — handles datasets too large for the closed form, and supports online learning where you update the model as new data arrives.