Skip to content

Random Sampling

NumPy's random module powers simulations, sampling, ML data shuffling, and statistical experiments.

The modern way — default_rng

Always use np.random.default_rng() — it's the modern, properly-engineered API.

import numpy as np

rng = np.random.default_rng()
print(rng.random(5))            # 5 floats in [0, 1)
print(rng.integers(0, 10, 8))    # 8 ints in [0, 10)
print(rng.normal(0, 1, 5))       # 5 samples from N(0, 1)

The older np.random.rand(), np.random.randn() still work but are deprecated for new code. Use default_rng().

Reproducibility — set a seed

For reproducible results (testing, debugging, papers), pass a seed:

import numpy as np

# Same seed = same sequence every time
rng1 = np.random.default_rng(seed=42)
print(rng1.random(3))

rng2 = np.random.default_rng(seed=42)
print(rng2.random(3))           # identical

Uniform distribution

.random(size=...) → floats in [0, 1). Scale and shift for any range:

import numpy as np

rng = np.random.default_rng(0)

print(rng.random(5))                       # [0, 1)
print(rng.random((2, 3)))                   # 2D array

# Scale to [10, 20)
print(rng.uniform(10, 20, size=5))

# Just integers
print(rng.integers(0, 100, size=10))        # [0, 100)
print(rng.integers(0, 100, size=10, endpoint=True))   # [0, 100]

Normal (Gaussian) distribution

mean ± std — the bell curve.

import numpy as np

rng = np.random.default_rng(0)

# Standard normal — mean=0, std=1
samples = rng.standard_normal(10000)
print(f"mean = {samples.mean():.4f} (≈ 0)")
print(f"std  = {samples.std():.4f}  (≈ 1)")

# Custom mean and std
heights = rng.normal(loc=170, scale=10, size=5)
print(f"heights: {heights.round(1)}")

Other distributions

import numpy as np

rng = np.random.default_rng(0)

print("Exponential :", rng.exponential(scale=1.0, size=5))
print("Poisson     :", rng.poisson(lam=3, size=10))
print("Binomial    :", rng.binomial(n=10, p=0.5, size=10))
print("Beta        :", rng.beta(a=2, b=5, size=5))
print("Gamma       :", rng.gamma(shape=2, scale=2, size=5))

choice — pick from a set

import numpy as np

rng = np.random.default_rng(0)

cards = ["A", "K", "Q", "J", "10", "9", "8", "7"]

# Single
print(rng.choice(cards))

# 5 with replacement (default)
print(rng.choice(cards, size=5))

# 3 without replacement
print(rng.choice(cards, size=3, replace=False))

# Weighted — "K" is 5× more likely
weights = [0.5, 2.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]
weights = np.array(weights) / sum(weights)        # must sum to 1
print(rng.choice(cards, size=10, p=weights))

permutation and shuffle

import numpy as np

rng = np.random.default_rng(0)

# Returns a shuffled COPY
print(rng.permutation(np.arange(10)))

# Modifies in place
a = np.arange(10)
rng.shuffle(a)
print(a)

# For 2D — shuffles ROWS by default
matrix = np.arange(12).reshape(4, 3)
rng.shuffle(matrix)
print(matrix)

This is how you shuffle training data in ML.

Train/test split — DIY

import numpy as np

rng = np.random.default_rng(42)

# Imagine 100 samples
X = np.arange(100)

# Shuffle indices, take 80% for train
indices = rng.permutation(100)
train_idx = indices[:80]
test_idx  = indices[80:]

X_train, X_test = X[train_idx], X[test_idx]
print(f"train ({len(X_train)}): {sorted(X_train.tolist())[:10]}...")
print(f"test  ({len(X_test)}):  {sorted(X_test.tolist())[:10]}...")

Monte Carlo simulation — estimate π

import numpy as np

rng = np.random.default_rng(0)
N = 1_000_000

# Random points in the unit square
x = rng.random(N)
y = rng.random(N)

# Distance from origin
inside_circle = (x**2 + y**2) < 1
pi_estimate = 4 * inside_circle.sum() / N

print(f"Estimated π: {pi_estimate:.5f}")
print(f"True       π: {np.pi:.5f}")
print(f"Error      : {abs(pi_estimate - np.pi):.5f}")

We threw 1M random points into a square, counted how many landed inside a quarter circle. Beautiful.

Sampling with custom probabilities

import numpy as np

rng = np.random.default_rng(0)

# Simulate a loaded die — 6 is twice as likely as anything else
outcomes = np.arange(1, 7)
prob = np.array([1, 1, 1, 1, 1, 2]) / 7

rolls = rng.choice(outcomes, size=10_000, p=prob)

# Count the outcomes
vals, counts = np.unique(rolls, return_counts=True)
for v, c in zip(vals, counts):
    print(f"  {v}: {c:5}  ({c/10000*100:.1f}%)")

Bootstrap — estimate uncertainty

Resampling your data to estimate how stable a statistic is:

import numpy as np

rng = np.random.default_rng(0)
data = rng.normal(loc=100, scale=15, size=50)

# Original mean
print(f"Original mean: {data.mean():.2f}")

# Bootstrap — sample with replacement many times, compute mean each time
n_bootstraps = 1000
means = []
for _ in range(n_bootstraps):
    sample = rng.choice(data, size=len(data), replace=True)
    means.append(sample.mean())

means = np.array(means)
print(f"Bootstrap mean of means: {means.mean():.2f}")
print(f"95% CI: ({np.percentile(means, 2.5):.2f}, {np.percentile(means, 97.5):.2f})")

A simple but powerful technique for ML evaluation.

Random arrays of any shape

import numpy as np

rng = np.random.default_rng(0)

img_batch = rng.random((4, 8, 8))             # 4 grayscale 8x8 "images"
print("batch shape:", img_batch.shape)

# Same for any distribution
weights = rng.normal(0, 0.01, size=(64, 10))   # NN weights
print("weights shape:", weights.shape, "stats:", weights.mean(), weights.std())

Cheatsheet

Distribution Function
Uniform [0, 1) rng.random(size)
Uniform [a, b) rng.uniform(a, b, size)
Integers [low, high) rng.integers(low, high, size)
Standard normal rng.standard_normal(size)
Normal(μ, σ) rng.normal(mu, sigma, size)
Exponential rng.exponential(scale, size)
Poisson rng.poisson(lam, size)
Binomial rng.binomial(n, p, size)
Pick from list rng.choice(items, size)
Permutation rng.permutation(arr)
Shuffle in place rng.shuffle(arr)

Always start with: rng = np.random.default_rng(seed=42).

Common pitfalls

  • Using np.random.seed() globally — the old API. Use default_rng(seed=...) per rng for cleaner code.
  • rng.random() vs rng.uniform(0, 1) — same thing, random is shorthand.
  • rng.integers(0, 10) is exclusive of 10 — pass endpoint=True to include it.
  • Probabilities don't sum to 1rng.choice(..., p=weights) raises if weights doesn't sum to 1.
  • Shuffling 2D arrays — only the first axisrng.shuffle(matrix) shuffles rows, not cells. Use rng.permutation indexing if you want more control.

Practice

What does this print?

Expected: [ 9 77 65 44 9]

import numpy as np
rng = np.random.default_rng(seed=42)
print(rng.integers(0, 100, size=5))

Make this produce the same output every run (it currently changes)

Expected: [0.77395605 0.43887844 0.85859792]

import numpy as np
rng = np.random.default_rng()      # bug: no seed → different output every run
print(rng.random(3))

Quiz — Quick check

What you remember

Q1. Why prefer np.random.default_rng() over np.random.rand()?

  • It's required in NumPy 2.0
  • It's the modern, properly-engineered API with better statistical properties and explicit seeds
  • It's faster
  • np.random.rand is removed

Why: default_rng() returns a Generator object you can seed and pass around. The legacy np.random.rand() and friends share global state, which is bug-prone in larger codebases.

Q2. Does rng.integers(0, 10, size=5) include 10?

  • Yes — both ends inclusive
  • No — high is exclusive by default (pass endpoint=True to include)
  • Only sometimes
  • Depends on the seed

Why: Default is half-open [low, high) — same as Python's range(). To include the endpoint, use rng.integers(0, 10, size=5, endpoint=True).

Q3. What does seeding do?

  • Makes results faster
  • Makes random sequences reproducible — same seed → same numbers every run
  • Removes bias from samples
  • Required for the random module to work

Why: Random number generators are deterministic — they produce a sequence based on a seed. Same seed = same sequence. Essential for testing, debugging, and reproducible papers.

Common doubts

Should I use the same rng across the whole program?

Yes — that way you can seed once at startup and get reproducible results everywhere. Pass rng as an argument to functions that need randomness. Avoid creating new default_rng() calls scattered through code.

rng.choice with replace=False for a large array is slow — why?

replace=False requires NumPy to track which items have been picked. For huge arrays where you need a small unique sample, prefer rng.permutation(N)[:k] (shuffle indices, take the first k). Same result, often faster.

What's the difference between rng.shuffle and rng.permutation?

shuffle modifies the array in place and returns None. permutation returns a shuffled copy. Use permutation when you want the original preserved; use shuffle when you don't care about the original.

What's next

Stacking & Splitting Arrays