Skip to content

Gradient Descent

Gradient descent is the algorithm sklearn (and almost every ML library) uses behind the scenes to find the best coefficients when there's no shortcut formula.

The intuition

Picture the loss (how wrong the model is) as a hilly landscape. The goal: find the lowest point.

You start somewhere random. At each step, you look at the slope under your feet and take a small step downhill. Repeat until you can't go any lower.

loss
  |\
  | \         _
  |  \       / \
  |   \_   _/   \_
  |     \_/        \_ ← best point
  +-------------------> β

The recipe

1. Initialize β to zeros (or randomly).
2. Compute predictions:  ŷ = X · β
3. Compute the gradient (slope of the loss curve at current β).
4. Update:  β ← β − η · gradient    (η is the "learning rate")
5. Repeat steps 2-4 until convergence.
  • A small η → tiny steps, slow but stable.
  • A big η → giant steps, fast but can overshoot the minimum.

Manual gradient descent — runnable

Below is the entire algorithm in ~15 lines of NumPy. Run it and watch the loss drop on every iteration.

import numpy as np

# Fake data: y = 3x + 2 + noise
np.random.seed(0)
X = np.random.uniform(0, 10, 50).reshape(-1, 1)
y = 3 * X.ravel() + 2 + np.random.normal(0, 1, 50)

# Add a column of 1s so β₀ (intercept) becomes part of β
X1 = np.c_[np.ones(len(X)), X]
beta = np.zeros(2)              # [intercept, slope]
lr = 0.01
n = len(X)

for step in range(50):
    y_pred = X1 @ beta
    grad = (2 / n) * X1.T @ (y_pred - y)
    beta -= lr * grad
    loss = np.mean((y_pred - y) ** 2)
    if step % 10 == 0:
        print(f"step {step:3d} | loss={loss:6.2f} | β₀={beta[0]:.2f}  β₁={beta[1]:.2f}")

print("\nFinal:", beta)
print("Target was approximately [2, 3]")

You should see β₀ approach 2 and β₁ approach 3 as the loss shrinks.

When is gradient descent the right choice?

  • Big datasets (millions of rows) — the closed-form formula needs an O(p³) matrix inverse that's slow at scale.
  • Online / streaming learning — you can update one sample at a time.
  • Any model that doesn't have a closed-form solution — neural networks, logistic regression, gradient boosting.
  • Small datasets — closed-form (LinearRegression()) is exact and instant.

Sklearn's SGD version

from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler
import numpy as np

np.random.seed(0)
X = np.random.uniform(0, 10, 100).reshape(-1, 1)
y = 3 * X.ravel() + 2 + np.random.normal(0, 1, 100)

# Always scale before SGD
X_scaled = StandardScaler().fit_transform(X)

sgd = SGDRegressor(
    learning_rate="invscaling",  # auto-decaying learning rate
    eta0=0.01,
    max_iter=200,
    random_state=42,
).fit(X_scaled, y)

print("coef:", sgd.coef_)
print("intercept:", sgd.intercept_)

What you learned

  • Gradient descent iteratively updates β by walking downhill on the loss surface.
  • The learning rate η is the most important knob.
  • SGDRegressor is sklearn's gradient-descent-based regressor.
  • Always scale your features before SGD. Without scaling, the loss landscape is stretched and convergence is poor.

Practice

What does this print?

Expected: True

# Gradient descent updates: w_new = w - learning_rate * gradient
w = 5.0
grad = 2.0
lr = 0.1
w_new = w - lr * grad
print(w_new < w)

Set a small enough learning rate so the loss decreases (currently it diverges)

Expected: True

# We try to minimize f(w) = w², gradient = 2w
w = 1.0
lr = 5.0                            # bug: lr too high — w oscillates and grows
for _ in range(5):
    w = w - lr * (2 * w)
print(abs(w) < 10)                   # well-converged would be near 0

Quiz — Quick check

What you remember

Q1. What does gradient descent compute at each step?

  • A random update
  • The gradient (direction of steepest increase) and steps in the OPPOSITE direction
  • The exact minimum
  • The mean of the data

Why: Negative gradient points "downhill" on the loss surface. We take a step in that direction, scaled by the learning rate.

Q2. What happens with a learning rate that's too LARGE?

  • Faster convergence
  • The optimizer oscillates or diverges (loss grows instead of shrinks)
  • No effect
  • Always finds the global minimum

Why: Imagine standing on one side of a valley, taking a huge step — you overshoot to the other side. Repeat → oscillation or divergence. Fix: smaller lr, or use adaptive optimizers (Adam, AdaGrad) that auto-scale.

Q3. Why is gradient descent the standard optimization algorithm in ML?

  • Always finds the global minimum
  • Works for high-dimensional problems where closed-form solutions don't exist or are too expensive
  • Requires no math
  • Faster than all alternatives

Why: Deep neural networks have millions of parameters — no closed form. Gradient descent (and its variants like SGD, Adam) is the only practical optimizer at that scale.

Common doubts

How do I choose the learning rate?

Start with 0.01 or 0.001. If loss explodes → too high, divide by 10. If loss decreases slowly → too low, multiply by 10. For production, use adaptive optimizers (Adam, AdamW) that adjust the learning rate per parameter — fewer hyperparameters to tune.

What's the difference between gradient descent and least squares?

Least squares solves linear regression with a closed-form formula β = (XᵀX)⁻¹ Xᵀy — exact answer in one step, but only works for linear models and small data. Gradient descent works iteratively, handles arbitrary loss functions, and scales to massive models (it's how deep learning trains).

Why does sklearn use the closed form for LinearRegression but SGD for SGDRegressor?

LinearRegression is fast and exact for small/medium datasets. SGDRegressor is iterative (stochastic gradient descent) — handles datasets too large for the closed form, and supports online learning where you update the model as new data arrives.

Next: Types of Gradient Descent