Types of Gradient Descent¶

The basic gradient descent recipe has a question to answer at every step:

How many samples do I look at before each β update?

The answer gives three flavors:

Type	Samples per step	Pros	Cons
Batch	All N	Stable, smooth path	Slow on big data
Stochastic (SGD)	1	Fast, can escape saddles	Noisy convergence
Mini-batch	32 – 512	Best of both — production default	Tune batch size

Visualize the trade-off¶

Batch GD is steady but each step is expensive. SGD is fast per step but takes a noisier zig-zag path. Mini-batch sits in the middle.

            Batch                 SGD                  Mini-batch
loss        smooth, slow          noisy, fast          mostly smooth + fast
            ↘                     ⤵︎⤴︎⤵︎⤴︎              ↘↗↘↗↘
            ↘                     ⤵︎⤴︎                  ↘
            ↘ minimum             ⤵︎ minimum            ↘ minimum

Try all three — same data, same model¶

import numpy as np

np.random.seed(0)
n = 500
X = np.random.uniform(0, 10, n).reshape(-1, 1)
y = 3 * X.ravel() + 2 + np.random.normal(0, 1, n)
X1 = np.c_[np.ones(n), X]

def gradient_descent(batch_size, n_epochs=10, lr=0.01):
    beta = np.zeros(2)
    losses = []
    for _ in range(n_epochs):
        # Shuffle for SGD/mini-batch
        idx = np.random.permutation(n)
        for start in range(0, n, batch_size):
            batch_idx = idx[start:start + batch_size]
            Xb, yb = X1[batch_idx], y[batch_idx]
            y_pred = Xb @ beta
            grad = (2 / len(Xb)) * Xb.T @ (y_pred - yb)
            beta -= lr * grad
        losses.append(np.mean((X1 @ beta - y) ** 2))
    return beta, losses

print("Batch GD     (size=500):", gradient_descent(500)[0])
print("Mini-batch GD (size=32):", gradient_descent(32)[0])
print("SGD          (size=1)  :", gradient_descent(1, n_epochs=1)[0])
print("Target:                 [2, 3]")

All three converge — but mini-batch and SGD get there with cheaper per-step compute.

Which to use?¶

In modern frameworks (sklearn, PyTorch, TensorFlow), mini-batch is the default and what you should use 95% of the time.

Batch size 32-128 — small datasets, smaller batches.
Batch size 256-512 — large datasets / GPUs, larger batches.
Pure SGD (batch_size=1) — when you can only stream data one sample at a time.

In sklearn¶

from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler
import numpy as np

np.random.seed(0)
X = np.random.uniform(0, 10, 1000).reshape(-1, 1)
y = 3 * X.ravel() + 2 + np.random.normal(0, 1, 1000)
X_scaled = StandardScaler().fit_transform(X)

sgd = SGDRegressor(
    loss="squared_error",
    learning_rate="invscaling",
    eta0=0.01,
    max_iter=50,
    random_state=42,
).fit(X_scaled, y)

print("Final β:", sgd.coef_, sgd.intercept_)

SGDRegressor does mini-batch internally — you don't choose batch size directly, but the algorithm is the same family.

What you learned¶

Three flavors of gradient descent: Batch, Stochastic, Mini-batch.
Mini-batch is the default for nearly everything modern.
Batch size is a trade-off between speed-per-step and stability.

Practice¶

What does this print?

Expected: True

# Mini-batch GD: each step uses a subset, not the whole dataset
# For 1000 samples and batch_size=32, one epoch takes about 1000/32 = 31 steps
n_samples = 1000
batch_size = 32
steps_per_epoch = n_samples // batch_size
print(steps_per_epoch == 31)

Stochastic GD uses ONE sample per step (not all samples)

Expected: True

# Batch GD = use full dataset each step. SGD = one sample per step.
n_samples = 100
batch_gd_size = n_samples       # batch GD
sgd_size = n_samples            # bug: SGD should be 1
print(sgd_size == 1)

Quiz — Quick check¶

What you remember

Q1. What's the difference between Batch GD, Mini-batch GD, and SGD?

Number of samples used per update: full dataset (Batch), small subset (Mini-batch), one sample (SGD)
Different algorithms entirely
Loss functions differ
No real difference

Why: They all compute gradient and step in the negative direction — they differ only in how much data they use per step. Mini-batch (batch_size 32-256) is the sweet spot used in practice.

Q2. Why is SGD's path noisy compared to Batch GD?

SGD uses random initialization
Each step's gradient is computed from a single sample, so it's a noisy estimate of the true gradient
SGD uses a higher learning rate
Bug in implementations

Why: A single sample's gradient varies a lot. Averaging over a mini-batch (or the full dataset) smooths the noise. The noise in SGD actually helps escape shallow local minima.

Q3. What's the typical mini-batch size used in deep learning?

1
The full dataset
32, 64, 128, 256 — powers of 2 for GPU efficiency
Always 1000

Why: GPUs process powers-of-2 sizes most efficiently. The choice trades off: smaller batch = more noise (helps generalization), larger batch = more stable gradients (faster wall-clock convergence per epoch).

Common doubts¶

Why is SGD with momentum better than plain SGD?

Momentum accumulates past gradients, so the optimizer "remembers" the direction it's been moving. Helps roll through small ups and downs of the loss surface — similar to a ball rolling down a hill with inertia. The standard variant in deep learning.

Adam, RMSprop, AdaGrad — when to use which?

Adam is the default — works well in most cases. RMSprop is similar (Adam without momentum). AdaGrad is rarely used now — its learning rate shrinks too aggressively. Most deep learning starts with Adam(lr=1e-3).

Does the choice of optimizer matter as much as the learning rate?

The learning rate matters more. A well-tuned SGD often matches Adam. The reason Adam is popular is less tuning — its adaptive learning rate forgives a wider range of lr choices. For competitive results, tune both.

→ Next: Regression Metrics