Types of Gradient Descent¶
The basic gradient descent recipe has a question to answer at every step:
How many samples do I look at before each
βupdate?
The answer gives three flavors:
| Type | Samples per step | Pros | Cons |
|---|---|---|---|
| Batch | All N | Stable, smooth path | Slow on big data |
| Stochastic (SGD) | 1 | Fast, can escape saddles | Noisy convergence |
| Mini-batch | 32 – 512 | Best of both — production default | Tune batch size |
Visualize the trade-off¶
Batch GD is steady but each step is expensive. SGD is fast per step but takes a noisier zig-zag path. Mini-batch sits in the middle.
Batch SGD Mini-batch
loss smooth, slow noisy, fast mostly smooth + fast
↘ ⤵︎⤴︎⤵︎⤴︎ ↘↗↘↗↘
↘ ⤵︎⤴︎ ↘
↘ minimum ⤵︎ minimum ↘ minimum
Try all three — same data, same model¶
import numpy as np
np.random.seed(0)
n = 500
X = np.random.uniform(0, 10, n).reshape(-1, 1)
y = 3 * X.ravel() + 2 + np.random.normal(0, 1, n)
X1 = np.c_[np.ones(n), X]
def gradient_descent(batch_size, n_epochs=10, lr=0.01):
beta = np.zeros(2)
losses = []
for _ in range(n_epochs):
# Shuffle for SGD/mini-batch
idx = np.random.permutation(n)
for start in range(0, n, batch_size):
batch_idx = idx[start:start + batch_size]
Xb, yb = X1[batch_idx], y[batch_idx]
y_pred = Xb @ beta
grad = (2 / len(Xb)) * Xb.T @ (y_pred - yb)
beta -= lr * grad
losses.append(np.mean((X1 @ beta - y) ** 2))
return beta, losses
print("Batch GD (size=500):", gradient_descent(500)[0])
print("Mini-batch GD (size=32):", gradient_descent(32)[0])
print("SGD (size=1) :", gradient_descent(1, n_epochs=1)[0])
print("Target: [2, 3]")
All three converge — but mini-batch and SGD get there with cheaper per-step compute.
Which to use?¶
In modern frameworks (sklearn, PyTorch, TensorFlow), mini-batch is the default and what you should use 95% of the time.
- Batch size 32-128 — small datasets, smaller batches.
- Batch size 256-512 — large datasets / GPUs, larger batches.
- Pure SGD (batch_size=1) — when you can only stream data one sample at a time.
In sklearn¶
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler
import numpy as np
np.random.seed(0)
X = np.random.uniform(0, 10, 1000).reshape(-1, 1)
y = 3 * X.ravel() + 2 + np.random.normal(0, 1, 1000)
X_scaled = StandardScaler().fit_transform(X)
sgd = SGDRegressor(
loss="squared_error",
learning_rate="invscaling",
eta0=0.01,
max_iter=50,
random_state=42,
).fit(X_scaled, y)
print("Final β:", sgd.coef_, sgd.intercept_)
SGDRegressor does mini-batch internally — you don't choose batch size directly, but the algorithm is the same family.
What you learned¶
- Three flavors of gradient descent: Batch, Stochastic, Mini-batch.
- Mini-batch is the default for nearly everything modern.
- Batch size is a trade-off between speed-per-step and stability.
Practice¶
What does this print?
Expected: True
Stochastic GD uses ONE sample per step (not all samples)
Expected: True
Quiz — Quick check¶
What you remember
Q1. What's the difference between Batch GD, Mini-batch GD, and SGD?
- Number of samples used per update: full dataset (Batch), small subset (Mini-batch), one sample (SGD)
- Different algorithms entirely
- Loss functions differ
- No real difference
Why: They all compute gradient and step in the negative direction — they differ only in how much data they use per step. Mini-batch (batch_size 32-256) is the sweet spot used in practice.
Q2. Why is SGD's path noisy compared to Batch GD?
- SGD uses random initialization
- Each step's gradient is computed from a single sample, so it's a noisy estimate of the true gradient
- SGD uses a higher learning rate
- Bug in implementations
Why: A single sample's gradient varies a lot. Averaging over a mini-batch (or the full dataset) smooths the noise. The noise in SGD actually helps escape shallow local minima.
Q3. What's the typical mini-batch size used in deep learning?
- 1
- The full dataset
- 32, 64, 128, 256 — powers of 2 for GPU efficiency
- Always 1000
Why: GPUs process powers-of-2 sizes most efficiently. The choice trades off: smaller batch = more noise (helps generalization), larger batch = more stable gradients (faster wall-clock convergence per epoch).
Common doubts¶
Why is SGD with momentum better than plain SGD?
Momentum accumulates past gradients, so the optimizer "remembers" the direction it's been moving. Helps roll through small ups and downs of the loss surface — similar to a ball rolling down a hill with inertia. The standard variant in deep learning.
Adam, RMSprop, AdaGrad — when to use which?
Adam is the default — works well in most cases. RMSprop is similar (Adam without momentum). AdaGrad is rarely used now — its learning rate shrinks too aggressively. Most deep learning starts with Adam(lr=1e-3).
Does the choice of optimizer matter as much as the learning rate?
The learning rate matters more. A well-tuned SGD often matches Adam. The reason Adam is popular is less tuning — its adaptive learning rate forgives a wider range of lr choices. For competitive results, tune both.