Real-World Examples¶
You've learned the pieces. Here are five small projects that combine them.
1. Feature scaling for machine learning¶
In ML, you often need each feature on a comparable scale.
import numpy as np
rng = np.random.default_rng(0)
# Imagine 10 samples × 4 features
data = np.column_stack([
rng.integers(20, 60, size=10), # age
rng.normal(170, 8, size=10), # height_cm
rng.normal(70, 15, size=10), # weight_kg
rng.integers(20000, 200000, size=10), # income
])
print("raw data:")
print(data.round(1))
# Standardize each column to mean=0, std=1
mean = data.mean(axis=0)
std = data.std(axis=0)
scaled = (data - mean) / std
print("\nscaled:")
print(scaled.round(2))
print("\nmean(scaled):", scaled.mean(axis=0).round(3))
print("std (scaled):", scaled.std(axis=0).round(3))
That's exactly what sklearn.preprocessing.StandardScaler does — in 2 lines.
2. K-means clustering — from scratch¶
A famous unsupervised algorithm in ~20 lines:
import numpy as np
rng = np.random.default_rng(42)
# Generate 3 blobs of 2D points
centers_true = np.array([[0, 0], [5, 5], [-5, 4]])
points = np.vstack([
centers_true[i] + rng.normal(0, 1, size=(30, 2))
for i in range(3)
])
# Initialize 3 centroids randomly
k = 3
centroids = points[rng.choice(len(points), k, replace=False)]
# Iterate
for step in range(15):
# Assign each point to the nearest centroid
distances = np.linalg.norm(points[:, None, :] - centroids[None, :, :], axis=2)
labels = distances.argmin(axis=1)
# Move each centroid to the mean of its assigned points
new_centroids = np.array([points[labels == j].mean(axis=0) for j in range(k)])
if np.allclose(centroids, new_centroids):
break
centroids = new_centroids
print(f"Converged in {step+1} iterations")
print("Found centroids:")
print(centroids.round(2))
print("True centroids:")
print(centers_true)
That's the entire algorithm. NumPy broadcasting + vectorization made it concise and fast.
3. Image processing — invert a grayscale image¶
import numpy as np
# Fake an 8x8 grayscale image (values 0-255)
rng = np.random.default_rng(0)
img = rng.integers(0, 256, size=(8, 8), dtype=np.uint8)
print("Original:")
print(img)
# Invert: each pixel x → 255 - x
inverted = 255 - img
print("\nInverted:")
print(inverted)
# Brighten: add 50, but clip at 255
bright = np.clip(img.astype(int) + 50, 0, 255).astype(np.uint8)
print("\nBrightened:")
print(bright)
# Threshold: white if > 128 else black
binary = np.where(img > 128, 255, 0).astype(np.uint8)
print("\nBinary:")
print(binary)
Real grayscale images from PIL / OpenCV are exactly the same — 2D NumPy arrays of uint8.
4. Time-series rolling average¶
A "smoothing" trick:
import numpy as np
# Fake daily temperature for 30 days
rng = np.random.default_rng(0)
days = np.arange(30)
temps = 25 + 5 * np.sin(2 * np.pi * days / 30) + rng.normal(0, 2, 30)
# 7-day rolling average
window = 7
rolling = np.array([temps[i:i+window].mean() for i in range(len(temps) - window + 1)])
print("Original temps (last 10):")
print(temps[-10:].round(1))
print(f"\nRolling avg ({window}-day, last 10):")
print(rolling[-10:].round(1))
For huge arrays, np.convolve is faster — but this clear version is enough for most cases.
5. Monte Carlo — stock price simulation¶
A classic finance problem — simulating possible future prices:
import numpy as np
rng = np.random.default_rng(42)
S0 = 100 # starting price
mu = 0.0005 # daily expected return (~12% annualized)
sigma = 0.02 # daily volatility (~32% annualized)
days = 252 # one trading year
n_sims = 1000 # number of random paths
# Each row is one simulated path
returns = rng.normal(mu, sigma, size=(n_sims, days))
paths = S0 * np.exp(np.cumsum(returns, axis=1))
final_prices = paths[:, -1]
print(f"After {days} days:")
print(f" median end price : ${np.median(final_prices):.2f}")
print(f" 5th percentile : ${np.percentile(final_prices, 5):.2f}")
print(f" 95th percentile : ${np.percentile(final_prices, 95):.2f}")
print(f" chance < $90 : {(final_prices < 90).mean() * 100:.1f}%")
print(f" chance > $130 : {(final_prices > 130).mean() * 100:.1f}%")
Vectorized — 1000 paths × 252 days in milliseconds.
6. Bonus — sliding window for pattern detection¶
import numpy as np
rng = np.random.default_rng(0)
signal = rng.integers(0, 5, size=20)
print("signal:", signal)
# Find all 3-element runs that sum to >= 10
window = 3
threshold = 10
found = []
for i in range(len(signal) - window + 1):
s = signal[i:i+window].sum()
if s >= threshold:
found.append((i, signal[i:i+window].tolist(), s))
for idx, win, total in found:
print(f" at index {idx}: {win} → sum = {total}")
Wrap-up — when to reach for NumPy¶
| Task | Reach for NumPy when | Reach for Pandas when |
|---|---|---|
| Math on a list of numbers | always | — |
| 2D / 3D / 4D numeric data | always | sometimes (DataFrames are usually 2D) |
| Mixed column types (numeric + string) | hard | natural |
| Named columns | awkward | natural — Pandas |
| Time series with date index | awkward | natural — Pandas |
| Image / audio / scientific signals | NumPy | rarely |
| Deep learning tensors | NumPy (then PyTorch / TF) | — |
For most data work in the wild, you'll combine the two — Pandas on top, NumPy underneath.
What you've learned¶
- Why NumPy (speed, conciseness).
- Creating arrays from lists, zeros/ones/range/linspace, random.
- Inspecting with shape/dtype/ndim/size.
- Indexing — basic, fancy, boolean.
- Reshaping — reshape, transpose, flatten.
- Math — element-wise ops, ufuncs.
- Broadcasting — operating on differently-shaped arrays.
- Aggregations — sum/mean/std along axes.
- Sorting & searching — sort, argsort, where.
- Linear algebra — dot, matmul, inv, solve, eig, svd.
- Random — generators, distributions, sampling.
- Stacking & splitting — combining and breaking apart.
- Masks — filter, count, conditional modify.
- Real-world — feature scaling, k-means, image processing, Monte Carlo.
You're ready to dive into Pandas, Machine Learning, or any scientific Python library — they're all built on this foundation.
Practice¶
What does this print?
Expected: [105 155 100]
Standardize columns so each has mean=0 and std=1
Expected: [0. 0. 0.]
Quiz — Quick check¶
What you remember
Q1. To standardize columns of a (N, F) matrix, you should compute the mean with…
-
axis=0(collapses rows → per-column mean) -
axis=1(collapses cols → per-row mean) - No axis (global mean)
-
axis=-1
Why: Standardization is per feature, so you want one mean and std per column.
axis=0gives(F,)which broadcasts back to(N, F)for subtraction.
Q2. When simulating 1000 random walks of 252 days each, why use a (1000, 252) array instead of a loop?
- Loops are deprecated in NumPy
- Vectorized operations on the whole array are 10–100× faster than Python loops
- To save memory
- Random numbers can't be generated in a loop
Why: A single
rng.normal(mu, sigma, size=(1000, 252))call generates all the random numbers in compiled C. Looping in Python adds millions of interpreter-level calls.
Q3. In image processing, np.clip(arr + 50, 0, 255) is used to…
- Resize the image
- Convert to grayscale
- Brighten the image while keeping pixel values in valid
[0, 255]range - Sharpen edges
Why: Adding 50 to every pixel brightens, but pixel values must stay in
[0, 255].np.clipenforces the bounds without manual conditional code.
Common doubts¶
When should I leave NumPy and use Pandas?
Use Pandas when you have mixed-type columns (numbers + strings + dates), named columns, or time-indexed data. NumPy is for homogeneous numeric arrays. In practice, you'll use both — Pandas on top for the data layer, NumPy underneath for math.
Is implementing k-means from scratch realistic in production?
The implementation here is great for understanding, but for production use sklearn.cluster.KMeans — it has better initialization (kmeans++), convergence checks, multiple restarts, and handles edge cases. Rolling your own is excellent practice; using sklearn is excellent engineering.
How do I move from NumPy to PyTorch / TensorFlow tensors?
The mental model is the same — multi-dimensional arrays with broadcasting. The APIs are intentionally NumPy-flavored: torch.zeros((2, 3)), tensor.reshape(...), tensor.sum(dim=0) (dim instead of axis). Add .cuda() to run on GPU and .requires_grad_() to track gradients.