Ensembles — Random Forest, AdaBoost, Gradient Boosting, Stacking¶

1. Why this matters¶

A single decision tree overfits badly. But:

Average 100 randomized trees → Random Forest (variance reduction).
Train trees sequentially, each fixing the previous one's errors → Boosting (bias reduction).
Train several different models, learn how to weight them → Stacking.

On almost any tabular ML benchmark, ensembles beat single models. They're robust, accept raw features (no scaling), handle missing values (some), and are a fairly safe production default.

2. Mental model¶

flowchart TB
    subgraph Bagging [Bagging — parallel, average]
      D1[Bootstrap 1] --> T1[Tree 1]
      D2[Bootstrap 2] --> T2[Tree 2]
      D3[Bootstrap N] --> TN[Tree N]
      T1 --> AVG[Average / Majority Vote]
      T2 --> AVG
      TN --> AVG
    end
    subgraph Boosting [Boosting — sequential, residuals]
      M1[Model 1] -->|errors| M2[Model 2 fixes M1] -->|errors| M3[Model 3 fixes M2] --> SUM[Weighted sum]
    end
    subgraph Stacking [Stacking — meta-learner]
      A[Model A] --> META[Meta-model<br/>learns weights]
      B[Model B] --> META
      C[Model C] --> META
      META --> P[Final prediction]
    end

3. Random Forest (Bagging)¶

Train n_estimators decision trees, each on a bootstrap sample of rows and a random subset of features. Predictions average (regression) or majority vote (classification).

Why it works: individual trees are high-variance, low-bias. Averaging many uncorrelated trees cuts variance. Random feature subsets at each split keep the trees decorrelated.

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score

X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, stratify=y, random_state=42)

rf = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,             # full depth — RF rarely overfits in depth
    min_samples_split=2,
    min_samples_leaf=1,
    max_features="sqrt",        # √p for classification, p/3 for regression
    n_jobs=-1,
    random_state=42,
    class_weight="balanced",    # for imbalanced
).fit(X_tr, y_tr)

print("Test accuracy:", rf.score(X_te, y_te))

# Feature importance — built in
import pandas as pd
imp = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)
print(imp.head(10))

Out-of-bag score — free CV using samples not in each tree's bootstrap:

rf = RandomForestClassifier(n_estimators=300, oob_score=True, n_jobs=-1).fit(X, y)
print("OOB score:", rf.oob_score_)

Key hyperparameters:

Param	Effect
`n_estimators`	More = better, slower. 100-500 typical. Diminishing returns past ~300.
`max_depth`	Deeper trees = lower bias, higher variance. Often leave as `None`.
`min_samples_leaf`	Increase (e.g. 5) to regularize / smooth predictions.
`max_features`	`"sqrt"` for classification, `"log2"` or `1.0` for regression. Smaller = more diversity.
`class_weight="balanced"`	For imbalanced data.

4. AdaBoost (Adaptive Boosting)¶

Train a weak learner (default: shallow tree), find which samples it got wrong, upweight them, train the next learner, repeat. Final prediction is weighted vote.

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

ada = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),    # "decision stump" — classic
    n_estimators=100,
    learning_rate=1.0,
    random_state=42,
).fit(X_tr, y_tr)

print("Test:", ada.score(X_te, y_te))

Pros: Simple, often effective on clean data. Cons: Very sensitive to noisy labels / outliers (it keeps upweighting them).

Mostly historical interest — gradient boosting outperforms it in almost every modern benchmark.

5. Gradient Boosting (the modern workhorse)¶

Instead of upweighting wrong samples, fit each new tree to the residuals (gradient of the loss) of the cumulative ensemble. This generalizes to any differentiable loss.

from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=3,
    subsample=1.0,
    random_state=42,
).fit(X_tr, y_tr)

print("Test:", gb.score(X_te, y_te))

The two hyperparameters that matter most:

learning_rate (shrinkage) — smaller = more conservative steps. Smaller + more trees usually wins. Typical: 0.01–0.1.
n_estimators — more is better up to a point. Pair with early stopping if available.

For modern production, use XGBoost or LightGBM — they're vastly faster, handle NaN natively, and routinely top competitive benchmarks:

# pip install xgboost
from xgboost import XGBClassifier

xgb = XGBClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    min_child_weight=1,
    reg_alpha=0,
    reg_lambda=1,
    objective="binary:logistic",
    eval_metric="logloss",
    early_stopping_rounds=20,
    random_state=42,
    n_jobs=-1,
).fit(
    X_tr, y_tr,
    eval_set=[(X_te, y_te)],
    verbose=False,
)
print("Best iter:", xgb.best_iteration)
print("Test acc :", xgb.score(X_te, y_te))

# pip install lightgbm — faster, often as accurate
from lightgbm import LGBMClassifier
lgbm = LGBMClassifier(
    n_estimators=500, learning_rate=0.05, num_leaves=31,
    subsample=0.8, colsample_bytree=0.8, random_state=42,
).fit(X_tr, y_tr, eval_set=[(X_te, y_te)])

A pragmatic tuning recipe for XGBoost / LightGBM:

Start with learning_rate=0.05, n_estimators=1000, early_stopping_rounds=50.
Tune max_depth ∈ {3, 5, 6, 8, 10}.
Tune subsample, colsample_bytree ∈ {0.6, 0.8, 1.0}.
Tune min_child_weight / reg_lambda for regularization.
Final pass: reduce learning_rate to 0.01 and let early-stopping pick n_estimators.

6. Stacking¶

Train several base models, use their predictions as features for a "meta" model that learns the best combination.

from sklearn.ensemble import StackingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from xgboost import XGBClassifier

stack = StackingClassifier(
    estimators=[
        ("rf",  RandomForestClassifier(n_estimators=200, random_state=42)),
        ("xgb", XGBClassifier(n_estimators=200, learning_rate=0.05, random_state=42)),
        ("svc", SVC(probability=True, random_state=42)),
    ],
    final_estimator=LogisticRegression(),
    cv=5,                          # CV to generate out-of-fold base predictions
    n_jobs=-1,
)
stack.fit(X_tr, y_tr)
print("Stacked test:", stack.score(X_te, y_te))

Blending is the simpler cousin: train base models, predict on a holdout set, train the meta-model on those predictions (no CV). Faster, slightly weaker.

Use stacking when: - Base models are diverse (different families: linear + tree + SVM). - You can afford the compute (each base model + meta). - You've hit a plateau with single-model tuning.

In practice, a well-tuned XGBoost or LightGBM rarely needs stacking for tabular data.

7. Feature importance — interpret what the ensemble learned¶

import pandas as pd
imp = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)
imp.head(15).plot.barh()

# For XGBoost / LightGBM — more options
xgb.feature_importances_                              # default: 'weight'
xgb.get_booster().get_score(importance_type="gain")   # better signal

# Model-agnostic permutation importance — most reliable
from sklearn.inspection import permutation_importance
result = permutation_importance(rf, X_te, y_te, n_repeats=10, random_state=42)
perm_imp = pd.Series(result.importances_mean, index=X.columns).sort_values(ascending=False)

For deeper interpretability: use SHAP values (pip install shap).

8. Common pitfalls¶

❗ Treating Random Forest like a black box. Inspect feature importances. Outliers in importance often reveal data leakage (target-correlated features that won't exist in production).
❗ Using learning_rate=0.1 and n_estimators=10000 on XGBoost without early stopping. Wastes hours overfitting. Use early_stopping_rounds.
❗ Scaling features for tree models. Pointless. Trees are scale-invariant.
❗ One-hot encoding high-cardinality features for boosted trees. Bloats memory. XGBoost / LightGBM accept ordinal / categorical features natively (LightGBM has categorical_feature=).
❗ AdaBoost on noisy labels. It amplifies misclassified samples — if some are mislabeled, it focuses there. Bad fit for noisy datasets.
❗ Stacking without out-of-fold predictions. Using base-model train predictions to fit the meta-model leaks; the meta sees data the base models memorized. StackingClassifier(cv=...) handles this; manual blending must use a real holdout.
❗ Comparing tree-ensemble accuracy to linear baseline using the same hyperparameters. Ensembles tolerate raw, unscaled, unengineered features. Linear models need scaling + feature engineering. Compare both at their best, not their worst.
❗ Trusting feature_importances_ (built-in) too much. Biased toward high-cardinality features. Use permutation importance or SHAP for serious interpretation.

9. When to use what¶

Task	First-try model	If accuracy matters
Tabular classification	`RandomForestClassifier`	`LGBMClassifier` / `XGBClassifier`
Tabular regression	`RandomForestRegressor`	`LGBMRegressor` / `XGBRegressor`
Need probabilities	RF or LR	Calibrate XGB with `CalibratedClassifierCV`
Very few examples (<200)	Logistic / Random Forest	Stick with simple
100k+ examples, wide features	LightGBM (fastest, good defaults)	Tune with `early_stopping_rounds`
Need explainability	Random Forest + permutation importance / SHAP	Logistic regression instead
Highly imbalanced	`class_weight="balanced"` + adjust threshold	+ `scale_pos_weight` in XGBoost

10. Cheatsheet¶

# Bagging
from sklearn.ensemble import (
    RandomForestClassifier, RandomForestRegressor,
    BaggingClassifier, BaggingRegressor,
    ExtraTreesClassifier, ExtraTreesRegressor,    # "extreme" RF — even more random
)

# Boosting
from sklearn.ensemble import (
    AdaBoostClassifier, AdaBoostRegressor,
    GradientBoostingClassifier, GradientBoostingRegressor,
    HistGradientBoostingClassifier, HistGradientBoostingRegressor,  # XGBoost-like, sklearn-native
)
from xgboost import XGBClassifier, XGBRegressor
from lightgbm import LGBMClassifier, LGBMRegressor
from catboost import CatBoostClassifier, CatBoostRegressor

# Stacking
from sklearn.ensemble import StackingClassifier, StackingRegressor

# Random Forest defaults that usually work
RandomForestClassifier(
    n_estimators=300,
    max_depth=None,
    min_samples_leaf=1,
    max_features="sqrt",
    n_jobs=-1, random_state=42,
    class_weight="balanced",
    oob_score=True,
)

# XGBoost with early stopping — production default
XGBClassifier(
    n_estimators=2000, learning_rate=0.05, max_depth=6,
    subsample=0.8, colsample_bytree=0.8,
    early_stopping_rounds=50,
    eval_metric="logloss", n_jobs=-1, random_state=42,
).fit(X_tr, y_tr, eval_set=[(X_te, y_te)], verbose=False)

# LightGBM — often as good, much faster
LGBMClassifier(
    n_estimators=2000, learning_rate=0.05, num_leaves=31,
    subsample=0.8, colsample_bytree=0.8, random_state=42,
).fit(X_tr, y_tr, eval_set=[(X_te, y_te)], callbacks=[lgb.early_stopping(50)])

# Permutation importance (model-agnostic, more reliable than .feature_importances_)
from sklearn.inspection import permutation_importance
result = permutation_importance(model, X_te, y_te, n_repeats=10, random_state=42)

# SHAP — gold standard for interpretation
# pip install shap
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_sample)
shap.summary_plot(shap_values, X_sample)

11. Q&A — recall test¶

Q: Bagging vs Boosting in one sentence? A: Bagging trains many models in parallel on bootstraps and averages (reduces variance). Boosting trains sequentially, each fixing the previous one's errors (reduces bias).
Q: Two most important XGBoost hyperparameters? A: learning_rate (shrinkage) and n_estimators (number of boosting rounds). Pair small learning_rate with more estimators and early stopping.
Q: Why doesn't tree-based ensemble need feature scaling? A: Decision trees split on thresholds per feature. The split point shifts with scale; the partition doesn't change. So scaling is a no-op.
Q: What's oob_score in Random Forest? A: Out-of-bag score — each tree only trains on ~63% of samples (the bootstrap). The other ~37% form a "free" validation set per tree. Aggregated across trees, it gives an honest performance estimate without explicit CV.
Q: When does stacking help most? A: When base models are diverse and individually strong but make different mistakes. The meta-model learns when to trust each. Often diminishing returns vs a well-tuned single GBM.
Q: Why is feature_importances_ biased? A: It favors features with many possible split values (high cardinality). Use permutation importance or SHAP for a fairer view.
Q: RandomForest or XGBoost for a new project? A: Try RandomForest first — almost no tuning needed, robust defaults. Then XGBoost or LightGBM for the final 1-3% accuracy improvement.

Practice¶

What does this print?

Expected: 100

from sklearn.ensemble import RandomForestClassifier
print(RandomForestClassifier().n_estimators)    # default: 100 trees

Set random_state for reproducibility in a Random Forest

Expected: True

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
X, y = make_classification(random_state=0)
rf1 = RandomForestClassifier().fit(X, y)              # bug: no random_state — different result every run
rf2 = RandomForestClassifier().fit(X, y)
print((rf1.predict(X) == rf2.predict(X)).all())

Quiz — Quick check¶

What you remember

Q1. How does a Random Forest reduce overfitting compared to a single tree?

By training many trees on bootstrapped samples and averaging their predictions
By using deeper trees
By regularizing the loss function
By using fewer features

Why: Each tree overfits in a slightly different way. Averaging cancels out the random noise of individual trees while preserving the signal. More trees = more averaging = less overfitting (with diminishing returns after ~100-500).

Q2. What's the difference between bagging (Random Forest) and boosting (XGBoost)?

Bagging trains trees independently in parallel; boosting trains them sequentially, each correcting the previous one's errors
No difference
Bagging is for classification, boosting for regression
Boosting uses fewer trees

Why: Bagging averages independent estimates. Boosting builds an ensemble where each new learner focuses on the residuals of the previous ones. Boosting usually achieves slightly higher accuracy but is more prone to overfitting and harder to tune.

Q3. Why is Gradient Boosting (XGBoost/LightGBM) so popular for tabular data?

Often state-of-the-art accuracy on tabular data, with built-in handling for missing values and mixed dtypes
It's the fastest model
It doesn't need preprocessing
It's interpretable

Why: Kaggle winners use it for a reason — gradient boosting consistently produces top results on structured data. Native NaN handling, feature importance reporting, and gradient-based training make it the go-to.

Common doubts¶

Random Forest vs XGBoost — when does each win?

Random Forest wins when you want zero tuning, fast prototyping, and robustness — it just works. XGBoost/LightGBM wins when you need the last 1-3% accuracy and have time to tune. For most production work, either is fine; the difference is often less than the noise in your data.

How many trees should I use?

Start with 100. More trees = better accuracy with diminishing returns. The cost: slower predictions. For Random Forest, beyond ~500 trees usually doesn't help. For boosting, use early stopping on a validation set rather than picking a fixed number.

Why is my Random Forest slow to predict?

Because each prediction must traverse hundreds of trees. Speed up by: (1) reducing n_estimators after tuning, (2) using n_jobs=-1 for parallel prediction, (3) for production, consider serving with treelite or ONNX which compile trees to efficient code.