Ensembles — Random Forest, AdaBoost, Gradient Boosting, Stacking¶
1. Why this matters¶
A single decision tree overfits badly. But:
- Average 100 randomized trees → Random Forest (variance reduction).
- Train trees sequentially, each fixing the previous one's errors → Boosting (bias reduction).
- Train several different models, learn how to weight them → Stacking.
On almost any tabular ML benchmark, ensembles beat single models. They're robust, accept raw features (no scaling), handle missing values (some), and are a fairly safe production default.
2. Mental model¶
flowchart TB
subgraph Bagging [Bagging — parallel, average]
D1[Bootstrap 1] --> T1[Tree 1]
D2[Bootstrap 2] --> T2[Tree 2]
D3[Bootstrap N] --> TN[Tree N]
T1 --> AVG[Average / Majority Vote]
T2 --> AVG
TN --> AVG
end
subgraph Boosting [Boosting — sequential, residuals]
M1[Model 1] -->|errors| M2[Model 2 fixes M1] -->|errors| M3[Model 3 fixes M2] --> SUM[Weighted sum]
end
subgraph Stacking [Stacking — meta-learner]
A[Model A] --> META[Meta-model<br/>learns weights]
B[Model B] --> META
C[Model C] --> META
META --> P[Final prediction]
end
3. Random Forest (Bagging)¶
Train n_estimators decision trees, each on a bootstrap sample of rows and a random subset of features. Predictions average (regression) or majority vote (classification).
Why it works: individual trees are high-variance, low-bias. Averaging many uncorrelated trees cuts variance. Random feature subsets at each split keep the trees decorrelated.
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, stratify=y, random_state=42)
rf = RandomForestClassifier(
n_estimators=300,
max_depth=None, # full depth — RF rarely overfits in depth
min_samples_split=2,
min_samples_leaf=1,
max_features="sqrt", # √p for classification, p/3 for regression
n_jobs=-1,
random_state=42,
class_weight="balanced", # for imbalanced
).fit(X_tr, y_tr)
print("Test accuracy:", rf.score(X_te, y_te))
# Feature importance — built in
import pandas as pd
imp = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)
print(imp.head(10))
Out-of-bag score — free CV using samples not in each tree's bootstrap:
rf = RandomForestClassifier(n_estimators=300, oob_score=True, n_jobs=-1).fit(X, y)
print("OOB score:", rf.oob_score_)
Key hyperparameters:
| Param | Effect |
|---|---|
n_estimators |
More = better, slower. 100-500 typical. Diminishing returns past ~300. |
max_depth |
Deeper trees = lower bias, higher variance. Often leave as None. |
min_samples_leaf |
Increase (e.g. 5) to regularize / smooth predictions. |
max_features |
"sqrt" for classification, "log2" or 1.0 for regression. Smaller = more diversity. |
class_weight="balanced" |
For imbalanced data. |
4. AdaBoost (Adaptive Boosting)¶
Train a weak learner (default: shallow tree), find which samples it got wrong, upweight them, train the next learner, repeat. Final prediction is weighted vote.
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
ada = AdaBoostClassifier(
estimator=DecisionTreeClassifier(max_depth=1), # "decision stump" — classic
n_estimators=100,
learning_rate=1.0,
random_state=42,
).fit(X_tr, y_tr)
print("Test:", ada.score(X_te, y_te))
Pros: Simple, often effective on clean data. Cons: Very sensitive to noisy labels / outliers (it keeps upweighting them).
Mostly historical interest — gradient boosting outperforms it in almost every modern benchmark.
5. Gradient Boosting (the modern workhorse)¶
Instead of upweighting wrong samples, fit each new tree to the residuals (gradient of the loss) of the cumulative ensemble. This generalizes to any differentiable loss.
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier(
n_estimators=200,
learning_rate=0.1,
max_depth=3,
subsample=1.0,
random_state=42,
).fit(X_tr, y_tr)
print("Test:", gb.score(X_te, y_te))
The two hyperparameters that matter most:
learning_rate(shrinkage) — smaller = more conservative steps. Smaller + more trees usually wins. Typical: 0.01–0.1.n_estimators— more is better up to a point. Pair with early stopping if available.
For modern production, use XGBoost or LightGBM — they're vastly faster, handle NaN natively, and routinely top competitive benchmarks:
# pip install xgboost
from xgboost import XGBClassifier
xgb = XGBClassifier(
n_estimators=500,
learning_rate=0.05,
max_depth=6,
subsample=0.8,
colsample_bytree=0.8,
min_child_weight=1,
reg_alpha=0,
reg_lambda=1,
objective="binary:logistic",
eval_metric="logloss",
early_stopping_rounds=20,
random_state=42,
n_jobs=-1,
).fit(
X_tr, y_tr,
eval_set=[(X_te, y_te)],
verbose=False,
)
print("Best iter:", xgb.best_iteration)
print("Test acc :", xgb.score(X_te, y_te))
# pip install lightgbm — faster, often as accurate
from lightgbm import LGBMClassifier
lgbm = LGBMClassifier(
n_estimators=500, learning_rate=0.05, num_leaves=31,
subsample=0.8, colsample_bytree=0.8, random_state=42,
).fit(X_tr, y_tr, eval_set=[(X_te, y_te)])
A pragmatic tuning recipe for XGBoost / LightGBM:
- Start with
learning_rate=0.05,n_estimators=1000,early_stopping_rounds=50. - Tune
max_depth∈ {3, 5, 6, 8, 10}. - Tune
subsample,colsample_bytree∈ {0.6, 0.8, 1.0}. - Tune
min_child_weight/reg_lambdafor regularization. - Final pass: reduce
learning_rateto 0.01 and let early-stopping pickn_estimators.
6. Stacking¶
Train several base models, use their predictions as features for a "meta" model that learns the best combination.
from sklearn.ensemble import StackingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from xgboost import XGBClassifier
stack = StackingClassifier(
estimators=[
("rf", RandomForestClassifier(n_estimators=200, random_state=42)),
("xgb", XGBClassifier(n_estimators=200, learning_rate=0.05, random_state=42)),
("svc", SVC(probability=True, random_state=42)),
],
final_estimator=LogisticRegression(),
cv=5, # CV to generate out-of-fold base predictions
n_jobs=-1,
)
stack.fit(X_tr, y_tr)
print("Stacked test:", stack.score(X_te, y_te))
Blending is the simpler cousin: train base models, predict on a holdout set, train the meta-model on those predictions (no CV). Faster, slightly weaker.
Use stacking when: - Base models are diverse (different families: linear + tree + SVM). - You can afford the compute (each base model + meta). - You've hit a plateau with single-model tuning.
In practice, a well-tuned XGBoost or LightGBM rarely needs stacking for tabular data.
7. Feature importance — interpret what the ensemble learned¶
import pandas as pd
imp = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)
imp.head(15).plot.barh()
# For XGBoost / LightGBM — more options
xgb.feature_importances_ # default: 'weight'
xgb.get_booster().get_score(importance_type="gain") # better signal
# Model-agnostic permutation importance — most reliable
from sklearn.inspection import permutation_importance
result = permutation_importance(rf, X_te, y_te, n_repeats=10, random_state=42)
perm_imp = pd.Series(result.importances_mean, index=X.columns).sort_values(ascending=False)
For deeper interpretability: use SHAP values (pip install shap).
8. Common pitfalls¶
- ❗ Treating Random Forest like a black box. Inspect feature importances. Outliers in importance often reveal data leakage (target-correlated features that won't exist in production).
- ❗ Using
learning_rate=0.1andn_estimators=10000on XGBoost without early stopping. Wastes hours overfitting. Useearly_stopping_rounds. - ❗ Scaling features for tree models. Pointless. Trees are scale-invariant.
- ❗ One-hot encoding high-cardinality features for boosted trees. Bloats memory. XGBoost / LightGBM accept ordinal / categorical features natively (LightGBM has
categorical_feature=). - ❗ AdaBoost on noisy labels. It amplifies misclassified samples — if some are mislabeled, it focuses there. Bad fit for noisy datasets.
- ❗ Stacking without out-of-fold predictions. Using base-model train predictions to fit the meta-model leaks; the meta sees data the base models memorized.
StackingClassifier(cv=...)handles this; manual blending must use a real holdout. - ❗ Comparing tree-ensemble accuracy to linear baseline using the same hyperparameters. Ensembles tolerate raw, unscaled, unengineered features. Linear models need scaling + feature engineering. Compare both at their best, not their worst.
- ❗ Trusting
feature_importances_(built-in) too much. Biased toward high-cardinality features. Use permutation importance or SHAP for serious interpretation.
9. When to use what¶
| Task | First-try model | If accuracy matters |
|---|---|---|
| Tabular classification | RandomForestClassifier |
LGBMClassifier / XGBClassifier |
| Tabular regression | RandomForestRegressor |
LGBMRegressor / XGBRegressor |
| Need probabilities | RF or LR | Calibrate XGB with CalibratedClassifierCV |
| Very few examples (<200) | Logistic / Random Forest | Stick with simple |
| 100k+ examples, wide features | LightGBM (fastest, good defaults) | Tune with early_stopping_rounds |
| Need explainability | Random Forest + permutation importance / SHAP | Logistic regression instead |
| Highly imbalanced | class_weight="balanced" + adjust threshold |
+ scale_pos_weight in XGBoost |
10. Cheatsheet¶
# Bagging
from sklearn.ensemble import (
RandomForestClassifier, RandomForestRegressor,
BaggingClassifier, BaggingRegressor,
ExtraTreesClassifier, ExtraTreesRegressor, # "extreme" RF — even more random
)
# Boosting
from sklearn.ensemble import (
AdaBoostClassifier, AdaBoostRegressor,
GradientBoostingClassifier, GradientBoostingRegressor,
HistGradientBoostingClassifier, HistGradientBoostingRegressor, # XGBoost-like, sklearn-native
)
from xgboost import XGBClassifier, XGBRegressor
from lightgbm import LGBMClassifier, LGBMRegressor
from catboost import CatBoostClassifier, CatBoostRegressor
# Stacking
from sklearn.ensemble import StackingClassifier, StackingRegressor
# Random Forest defaults that usually work
RandomForestClassifier(
n_estimators=300,
max_depth=None,
min_samples_leaf=1,
max_features="sqrt",
n_jobs=-1, random_state=42,
class_weight="balanced",
oob_score=True,
)
# XGBoost with early stopping — production default
XGBClassifier(
n_estimators=2000, learning_rate=0.05, max_depth=6,
subsample=0.8, colsample_bytree=0.8,
early_stopping_rounds=50,
eval_metric="logloss", n_jobs=-1, random_state=42,
).fit(X_tr, y_tr, eval_set=[(X_te, y_te)], verbose=False)
# LightGBM — often as good, much faster
LGBMClassifier(
n_estimators=2000, learning_rate=0.05, num_leaves=31,
subsample=0.8, colsample_bytree=0.8, random_state=42,
).fit(X_tr, y_tr, eval_set=[(X_te, y_te)], callbacks=[lgb.early_stopping(50)])
# Permutation importance (model-agnostic, more reliable than .feature_importances_)
from sklearn.inspection import permutation_importance
result = permutation_importance(model, X_te, y_te, n_repeats=10, random_state=42)
# SHAP — gold standard for interpretation
# pip install shap
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_sample)
shap.summary_plot(shap_values, X_sample)
11. Q&A — recall test¶
-
Q: Bagging vs Boosting in one sentence? A: Bagging trains many models in parallel on bootstraps and averages (reduces variance). Boosting trains sequentially, each fixing the previous one's errors (reduces bias).
-
Q: Two most important XGBoost hyperparameters? A:
learning_rate(shrinkage) andn_estimators(number of boosting rounds). Pair small learning_rate with more estimators and early stopping. -
Q: Why doesn't tree-based ensemble need feature scaling? A: Decision trees split on thresholds per feature. The split point shifts with scale; the partition doesn't change. So scaling is a no-op.
-
Q: What's
oob_scorein Random Forest? A: Out-of-bag score — each tree only trains on ~63% of samples (the bootstrap). The other ~37% form a "free" validation set per tree. Aggregated across trees, it gives an honest performance estimate without explicit CV. -
Q: When does stacking help most? A: When base models are diverse and individually strong but make different mistakes. The meta-model learns when to trust each. Often diminishing returns vs a well-tuned single GBM.
-
Q: Why is
feature_importances_biased? A: It favors features with many possible split values (high cardinality). Use permutation importance or SHAP for a fairer view. -
Q: RandomForest or XGBoost for a new project? A: Try RandomForest first — almost no tuning needed, robust defaults. Then XGBoost or LightGBM for the final 1-3% accuracy improvement.
Practice¶
What does this print?
Expected: 100
Set random_state for reproducibility in a Random Forest
Expected: True
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
X, y = make_classification(random_state=0)
rf1 = RandomForestClassifier().fit(X, y) # bug: no random_state — different result every run
rf2 = RandomForestClassifier().fit(X, y)
print((rf1.predict(X) == rf2.predict(X)).all())
Quiz — Quick check¶
What you remember
Q1. How does a Random Forest reduce overfitting compared to a single tree?
- By training many trees on bootstrapped samples and averaging their predictions
- By using deeper trees
- By regularizing the loss function
- By using fewer features
Why: Each tree overfits in a slightly different way. Averaging cancels out the random noise of individual trees while preserving the signal. More trees = more averaging = less overfitting (with diminishing returns after ~100-500).
Q2. What's the difference between bagging (Random Forest) and boosting (XGBoost)?
- Bagging trains trees independently in parallel; boosting trains them sequentially, each correcting the previous one's errors
- No difference
- Bagging is for classification, boosting for regression
- Boosting uses fewer trees
Why: Bagging averages independent estimates. Boosting builds an ensemble where each new learner focuses on the residuals of the previous ones. Boosting usually achieves slightly higher accuracy but is more prone to overfitting and harder to tune.
Q3. Why is Gradient Boosting (XGBoost/LightGBM) so popular for tabular data?
- Often state-of-the-art accuracy on tabular data, with built-in handling for missing values and mixed dtypes
- It's the fastest model
- It doesn't need preprocessing
- It's interpretable
Why: Kaggle winners use it for a reason — gradient boosting consistently produces top results on structured data. Native NaN handling, feature importance reporting, and gradient-based training make it the go-to.
Common doubts¶
Random Forest vs XGBoost — when does each win?
Random Forest wins when you want zero tuning, fast prototyping, and robustness — it just works. XGBoost/LightGBM wins when you need the last 1-3% accuracy and have time to tune. For most production work, either is fine; the difference is often less than the noise in your data.
How many trees should I use?
Start with 100. More trees = better accuracy with diminishing returns. The cost: slower predictions. For Random Forest, beyond ~500 trees usually doesn't help. For boosting, use early stopping on a validation set rather than picking a fixed number.
Why is my Random Forest slow to predict?
Because each prediction must traverse hundreds of trees. Speed up by: (1) reducing n_estimators after tuning, (2) using n_jobs=-1 for parallel prediction, (3) for production, consider serving with treelite or ONNX which compile trees to efficient code.