Skip to content

Real-World Examples

You've learned the pieces. Here are five small projects that combine them.

1. Feature scaling for machine learning

In ML, you often need each feature on a comparable scale.

import numpy as np

rng = np.random.default_rng(0)

# Imagine 10 samples × 4 features
data = np.column_stack([
    rng.integers(20, 60, size=10),         # age
    rng.normal(170, 8, size=10),           # height_cm
    rng.normal(70, 15, size=10),           # weight_kg
    rng.integers(20000, 200000, size=10),  # income
])
print("raw data:")
print(data.round(1))

# Standardize each column to mean=0, std=1
mean = data.mean(axis=0)
std  = data.std(axis=0)
scaled = (data - mean) / std

print("\nscaled:")
print(scaled.round(2))
print("\nmean(scaled):", scaled.mean(axis=0).round(3))
print("std (scaled):", scaled.std(axis=0).round(3))

That's exactly what sklearn.preprocessing.StandardScaler does — in 2 lines.

2. K-means clustering — from scratch

A famous unsupervised algorithm in ~20 lines:

import numpy as np

rng = np.random.default_rng(42)

# Generate 3 blobs of 2D points
centers_true = np.array([[0, 0], [5, 5], [-5, 4]])
points = np.vstack([
    centers_true[i] + rng.normal(0, 1, size=(30, 2))
    for i in range(3)
])

# Initialize 3 centroids randomly
k = 3
centroids = points[rng.choice(len(points), k, replace=False)]

# Iterate
for step in range(15):
    # Assign each point to the nearest centroid
    distances = np.linalg.norm(points[:, None, :] - centroids[None, :, :], axis=2)
    labels = distances.argmin(axis=1)

    # Move each centroid to the mean of its assigned points
    new_centroids = np.array([points[labels == j].mean(axis=0) for j in range(k)])

    if np.allclose(centroids, new_centroids):
        break
    centroids = new_centroids

print(f"Converged in {step+1} iterations")
print("Found centroids:")
print(centroids.round(2))
print("True centroids:")
print(centers_true)

That's the entire algorithm. NumPy broadcasting + vectorization made it concise and fast.

3. Image processing — invert a grayscale image

import numpy as np

# Fake an 8x8 grayscale image (values 0-255)
rng = np.random.default_rng(0)
img = rng.integers(0, 256, size=(8, 8), dtype=np.uint8)

print("Original:")
print(img)

# Invert: each pixel x → 255 - x
inverted = 255 - img
print("\nInverted:")
print(inverted)

# Brighten: add 50, but clip at 255
bright = np.clip(img.astype(int) + 50, 0, 255).astype(np.uint8)
print("\nBrightened:")
print(bright)

# Threshold: white if > 128 else black
binary = np.where(img > 128, 255, 0).astype(np.uint8)
print("\nBinary:")
print(binary)

Real grayscale images from PIL / OpenCV are exactly the same — 2D NumPy arrays of uint8.

4. Time-series rolling average

A "smoothing" trick:

import numpy as np

# Fake daily temperature for 30 days
rng = np.random.default_rng(0)
days = np.arange(30)
temps = 25 + 5 * np.sin(2 * np.pi * days / 30) + rng.normal(0, 2, 30)

# 7-day rolling average
window = 7
rolling = np.array([temps[i:i+window].mean() for i in range(len(temps) - window + 1)])

print("Original temps (last 10):")
print(temps[-10:].round(1))
print(f"\nRolling avg ({window}-day, last 10):")
print(rolling[-10:].round(1))

For huge arrays, np.convolve is faster — but this clear version is enough for most cases.

5. Monte Carlo — stock price simulation

A classic finance problem — simulating possible future prices:

import numpy as np

rng = np.random.default_rng(42)

S0 = 100             # starting price
mu = 0.0005          # daily expected return (~12% annualized)
sigma = 0.02         # daily volatility (~32% annualized)
days = 252           # one trading year
n_sims = 1000        # number of random paths

# Each row is one simulated path
returns = rng.normal(mu, sigma, size=(n_sims, days))
paths   = S0 * np.exp(np.cumsum(returns, axis=1))

final_prices = paths[:, -1]

print(f"After {days} days:")
print(f"  median end price : ${np.median(final_prices):.2f}")
print(f"  5th percentile   : ${np.percentile(final_prices,  5):.2f}")
print(f"  95th percentile  : ${np.percentile(final_prices, 95):.2f}")
print(f"  chance < $90     : {(final_prices < 90).mean() * 100:.1f}%")
print(f"  chance > $130    : {(final_prices > 130).mean() * 100:.1f}%")

Vectorized — 1000 paths × 252 days in milliseconds.

6. Bonus — sliding window for pattern detection

import numpy as np

rng = np.random.default_rng(0)
signal = rng.integers(0, 5, size=20)
print("signal:", signal)

# Find all 3-element runs that sum to >= 10
window = 3
threshold = 10

found = []
for i in range(len(signal) - window + 1):
    s = signal[i:i+window].sum()
    if s >= threshold:
        found.append((i, signal[i:i+window].tolist(), s))

for idx, win, total in found:
    print(f"  at index {idx}: {win} → sum = {total}")

Wrap-up — when to reach for NumPy

Task Reach for NumPy when Reach for Pandas when
Math on a list of numbers always
2D / 3D / 4D numeric data always sometimes (DataFrames are usually 2D)
Mixed column types (numeric + string) hard natural
Named columns awkward natural — Pandas
Time series with date index awkward natural — Pandas
Image / audio / scientific signals NumPy rarely
Deep learning tensors NumPy (then PyTorch / TF)

For most data work in the wild, you'll combine the two — Pandas on top, NumPy underneath.

What you've learned

  • Why NumPy (speed, conciseness).
  • Creating arrays from lists, zeros/ones/range/linspace, random.
  • Inspecting with shape/dtype/ndim/size.
  • Indexing — basic, fancy, boolean.
  • Reshaping — reshape, transpose, flatten.
  • Math — element-wise ops, ufuncs.
  • Broadcasting — operating on differently-shaped arrays.
  • Aggregations — sum/mean/std along axes.
  • Sorting & searching — sort, argsort, where.
  • Linear algebra — dot, matmul, inv, solve, eig, svd.
  • Random — generators, distributions, sampling.
  • Stacking & splitting — combining and breaking apart.
  • Masks — filter, count, conditional modify.
  • Real-world — feature scaling, k-means, image processing, Monte Carlo.

You're ready to dive into Pandas, Machine Learning, or any scientific Python library — they're all built on this foundation.

Practice

What does this print?

Expected: [105 155 100]

import numpy as np
img = np.array([5, 55, 200])
print(np.clip(img + 100, 0, 255))

Standardize columns so each has mean=0 and std=1

Expected: [0. 0. 0.]

import numpy as np
data = np.array([[1.0, 10.0, 100.0],
                 [2.0, 20.0, 200.0],
                 [3.0, 30.0, 300.0]])
scaled = data - data.mean()             # bug: subtracts global mean, not per-column
print(scaled.mean(axis=0).round(1))

Quiz — Quick check

What you remember

Q1. To standardize columns of a (N, F) matrix, you should compute the mean with…

  • axis=0 (collapses rows → per-column mean)
  • axis=1 (collapses cols → per-row mean)
  • No axis (global mean)
  • axis=-1

Why: Standardization is per feature, so you want one mean and std per column. axis=0 gives (F,) which broadcasts back to (N, F) for subtraction.

Q2. When simulating 1000 random walks of 252 days each, why use a (1000, 252) array instead of a loop?

  • Loops are deprecated in NumPy
  • Vectorized operations on the whole array are 10–100× faster than Python loops
  • To save memory
  • Random numbers can't be generated in a loop

Why: A single rng.normal(mu, sigma, size=(1000, 252)) call generates all the random numbers in compiled C. Looping in Python adds millions of interpreter-level calls.

Q3. In image processing, np.clip(arr + 50, 0, 255) is used to…

  • Resize the image
  • Convert to grayscale
  • Brighten the image while keeping pixel values in valid [0, 255] range
  • Sharpen edges

Why: Adding 50 to every pixel brightens, but pixel values must stay in [0, 255]. np.clip enforces the bounds without manual conditional code.

Common doubts

When should I leave NumPy and use Pandas?

Use Pandas when you have mixed-type columns (numbers + strings + dates), named columns, or time-indexed data. NumPy is for homogeneous numeric arrays. In practice, you'll use both — Pandas on top for the data layer, NumPy underneath for math.

Is implementing k-means from scratch realistic in production?

The implementation here is great for understanding, but for production use sklearn.cluster.KMeans — it has better initialization (kmeans++), convergence checks, multiple restarts, and handles edge cases. Rolling your own is excellent practice; using sklearn is excellent engineering.

How do I move from NumPy to PyTorch / TensorFlow tensors?

The mental model is the same — multi-dimensional arrays with broadcasting. The APIs are intentionally NumPy-flavored: torch.zeros((2, 3)), tensor.reshape(...), tensor.sum(dim=0) (dim instead of axis). Add .cuda() to run on GPU and .requires_grad_() to track gradients.

← Back to NumPy home