Logistic Regression & Classification Metrics¶
1. Why this matters¶
Logistic regression is: - The default first model for any classification problem. Fast, interpretable, calibrated probabilities. - The foundation under almost every classification algorithm — neural network classification heads, gradient boosting log-loss, etc.
And metrics — getting these wrong silently destroys real-world ML. A 99% accuracy spam classifier that flags every email as "not spam" looks great by accuracy and is completely useless.
2. Mental model¶
linear part: z = β₀ + β₁x₁ + ... + βₚxₚ
sigmoid: σ(z) = 1 / (1 + e⁻ᶻ)
output: P(y=1 | x) = σ(z)
decision: ŷ = 1 if P > 0.5 else 0 (threshold is tunable!)
The sigmoid squashes any real z into (0, 1):
Training picks β to minimize log-loss (binary cross-entropy):
No closed form — solved by gradient descent (or its variants).
3. Architecture / Flow¶
flowchart LR
X[X features] --> S[Scale]
S --> L[β₀ + β·X]
L --> SIG[σ z]
SIG --> P[Probability 0-1]
P --> T{> threshold?}
T -->|yes| C1[class 1]
T -->|no| C0[class 0]
4. Binary classification¶
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, stratify=y, random_state=42)
pipe = Pipeline([
("scale", StandardScaler()),
("clf", LogisticRegression(
C=1.0, # inverse of α — smaller = more reg
penalty="l2", # "l1", "l2", "elasticnet", "none"
solver="lbfgs", # "lbfgs", "liblinear", "saga"
max_iter=1000,
class_weight=None, # "balanced" for imbalanced data
)),
]).fit(X_tr, y_tr)
print("Accuracy:", pipe.score(X_te, y_te))
print(classification_report(y_te, pipe.predict(X_te)))
print("ROC-AUC :", roc_auc_score(y_te, pipe.predict_proba(X_te)[:, 1]))
C is the inverse regularization strength (sklearn quirk): smaller C = stronger penalty. Default C=1.0.
predict_proba(X) returns probabilities per class — essential for thresholding and ROC.
5. Multi-class¶
Two strategies, both built-in:
# One-vs-Rest (OvR): K binary classifiers, one per class
LogisticRegression(multi_class="ovr")
# Multinomial (softmax): one joint model — usually better
LogisticRegression(multi_class="multinomial", solver="lbfgs") # default in modern sklearn
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(max_iter=500).fit(X, y)
clf.predict_proba(X[:1]) # → [[p0, p1, p2]]
6. Classification metrics — the full table¶
For binary problems with the confusion matrix:
| Predicted 1 | Predicted 0 | |
|---|---|---|
| Actual 1 | TP | FN |
| Actual 0 | FP | TN |
| Metric | Formula | What it answers |
|---|---|---|
| Accuracy | (TP+TN) / total | "What fraction were correct?" — misleading on imbalanced data |
| Precision | TP / (TP+FP) | "When the model says positive, how often is it right?" |
| Recall (Sensitivity, TPR) | TP / (TP+FN) | "Of all real positives, how many did we catch?" |
| Specificity (TNR) | TN / (TN+FP) | "Of all real negatives, how many did we correctly identify?" |
| F1 | 2·P·R / (P+R) | Harmonic mean of precision and recall |
| ROC-AUC | area under TPR vs FPR curve | "How well does the model RANK positives above negatives?" — threshold-independent |
| Log-loss | − Σ y·log(p) + (1-y)·log(1-p) | Penalizes confident-wrong probabilities |
| PR-AUC | area under precision-recall | Better than ROC-AUC for extreme imbalance |
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
confusion_matrix, classification_report,
roc_auc_score, roc_curve, precision_recall_curve, average_precision_score,
log_loss,
)
y_pred = pipe.predict(X_te)
y_proba = pipe.predict_proba(X_te)[:, 1]
print("Acc :", accuracy_score(y_te, y_pred))
print("Prec:", precision_score(y_te, y_pred))
print("Rec :", recall_score(y_te, y_pred))
print("F1 :", f1_score(y_te, y_pred))
print("AUC :", roc_auc_score(y_te, y_proba))
print("LL :", log_loss(y_te, y_proba))
print(classification_report(y_te, y_pred))
print("Confusion:\n", confusion_matrix(y_te, y_pred))
7. Which metric should you pick?¶
Decision tree:
flowchart TD
A{Class balance?} -->|roughly 50/50| AC[Accuracy or F1 OK]
A -->|imbalanced| B{Cost asymmetry?}
B -->|FP and FN equally bad| F1[F1]
B -->|FP much worse e.g., spam filter| PR[Precision]
B -->|FN much worse e.g., cancer screening| RC[Recall]
B -->|need ranking quality| AUC[ROC-AUC<br/>or PR-AUC if heavy imbalance]
Concrete examples:
| Problem | Class balance | Right metric | Why |
|---|---|---|---|
| Cancer screening | imbalanced (rare) | Recall + PR-AUC | False negatives kill people |
| Spam filter | imbalanced | Precision | False positives lose legitimate mail |
| Search ranking | doesn't matter | ROC-AUC / NDCG | Ranking quality, not classification |
| Credit fraud | extreme imbalance | PR-AUC + Recall@k | Catch as many frauds as possible per investigation |
| General classifier | ~balanced | F1 + Accuracy | Cheap defaults |
8. The threshold matters¶
Default predict() uses 0.5. You should tune this based on the cost of FP vs FN:
from sklearn.metrics import precision_recall_curve
import numpy as np
prec, rec, thr = precision_recall_curve(y_te, y_proba)
# Pick threshold that maximizes F1
f1s = 2 * prec * rec / (prec + rec + 1e-9)
best_idx = np.argmax(f1s)
best_thr = thr[best_idx]
print(f"Best threshold: {best_thr:.3f}, F1: {f1s[best_idx]:.3f}")
# Apply
y_pred_new = (y_proba >= best_thr).astype(int)
Visualize the trade-off:
import matplotlib.pyplot as plt
plt.plot(rec, prec)
plt.xlabel("Recall"); plt.ylabel("Precision")
plt.title("Precision-Recall curve")
9. Imbalanced data — the cheats¶
# 1. class_weight="balanced" — weights inversely proportional to class frequencies
LogisticRegression(class_weight="balanced")
# or custom:
LogisticRegression(class_weight={0: 1, 1: 5})
# 2. Stratified split (always — keeps class proportions)
train_test_split(X, y, stratify=y, ...)
StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# 3. Resampling — imbalanced-learn package
# pip install imbalanced-learn
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline
pipe = ImbPipeline([
("scale", StandardScaler()),
("smote", SMOTE(random_state=42)), # synthesize minority samples
("clf", LogisticRegression(max_iter=500)),
])
# 4. Right metrics — never accuracy. Use F1, PR-AUC, balanced accuracy.
10. Common pitfalls¶
- ❗ Reporting accuracy on imbalanced data. 95% accuracy on a 95/5 problem is the trivial "predict majority class" baseline.
- ❗ Confusing
Cwithα. sklearn'sCis the inverse of regularization strength. SmallerC= more reg. - ❗ Using
predict()when you should usepredict_proba(). Probabilities give you threshold-tuning, ranking, and richer metrics. - ❗ Forgetting
stratify=yon train_test_split for imbalanced data. Random splits can leave a class missing from test. - ❗ No baseline. Always compare to "predict majority class" — it's frustratingly hard to beat sometimes.
- ❗ Calibration mismatch. Logistic regression is well-calibrated. Random forest / SVM are not — their
predict_probaoutputs don't correspond to real probabilities. Wrap withCalibratedClassifierCVif you need calibrated probabilities. - ❗ Ignoring class order in
roc_auc_score. Passpredict_proba()[:, 1]— the second column = probability of positive class. - ❗ Mixing
roc_auc_scorewith multi-class withoutmulti_class=argument. Defaults differ across versions.
11. When to use vs not use¶
| Use Logistic Regression when | Use something else when |
|---|---|
| Need calibrated probabilities | Pure rank/score is fine — try GBM, AUC matters more |
| Need interpretability (coefficients) | Black-box accuracy is OK — try GBM |
| Linear separability or close to it | Strong feature interactions you can't enumerate |
| Want a fast, robust baseline | High-stakes accuracy on rich tabular data → XGBoost / LightGBM |
| Multi-class with 3–10 classes | 1000+ classes → softmax neural net or hierarchical models |
| Streaming / online learning | SGDClassifier(loss="log_loss") |
12. Cheatsheet¶
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
roc_auc_score, average_precision_score, log_loss,
confusion_matrix, classification_report,
precision_recall_curve, roc_curve, ConfusionMatrixDisplay,
)
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
# Canonical pipeline
pipe = Pipeline([
("scale", StandardScaler()),
("clf", LogisticRegression(max_iter=1000, class_weight="balanced")),
])
# Hyperparam tuning
grid = GridSearchCV(pipe, param_grid={
"clf__C": [0.01, 0.1, 1, 10, 100],
"clf__penalty": ["l2"],
# for L1/elasticnet: "clf__solver": ["saga"], "clf__l1_ratio": [0.5]
}, scoring="f1", cv=5).fit(X_tr, y_tr)
print(grid.best_params_)
# Stratified CV (essential for imbalanced data)
StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Probabilities + custom threshold
proba = pipe.predict_proba(X_te)[:, 1]
y_pred = (proba >= 0.3).astype(int) # custom threshold
# Inspect confusion + report
ConfusionMatrixDisplay.from_estimator(pipe, X_te, y_te)
print(classification_report(y_te, y_pred))
13. Q&A — recall test¶
-
Q: Why is accuracy a bad metric for imbalanced data? A: A classifier that always predicts the majority class achieves accuracy = majority proportion (e.g., 95% on a 95/5 problem) while doing nothing useful.
-
Q: Precision vs Recall — which captures false alarms? A: Precision — TP / (TP+FP). Low precision = many false alarms. Use when FPs are costly (spam, ads).
-
Q: When does ROC-AUC mislead? A: On extreme class imbalance (1:1000). The curve is dominated by the majority. Prefer PR-AUC.
-
Q: What does
C=0.01vsC=100do in sklearnLogisticRegression? A:Cis the INVERSE of α.C=0.01→ strong regularization, simpler model.C=100→ weak regularization, potentially overfits. -
Q: Probabilities from
predict_proba()forRandomForestClassifier— can you trust them? A: Not directly — they're not well-calibrated. Wrap withCalibratedClassifierCV(method="isotonic")if calibrated probabilities matter. -
Q: How do you change the classification threshold from 0.5? A: Skip
predict(). Useproba = clf.predict_proba(X)[:, 1]; y_pred = (proba >= 0.3).astype(int).
Practice¶
What does this print?
Expected: True
Use F1 score (not accuracy) for a 95/5 imbalanced classification
Expected: True
Quiz — Quick check¶
What you remember
Q1. Which metric is misleading on an imbalanced dataset (e.g., 99% negative)?
- Accuracy
- Precision
- Recall
- F1
Why: A model that always predicts "negative" gets 99% accuracy with zero useful predictions. Use precision/recall/F1 or ROC-AUC for imbalanced problems.
Q2. What does predict_proba return?
- The predicted class
- An (n_samples, n_classes) array of probabilities per row
- The model's coefficients
- An error score
Why: Useful for picking a custom threshold instead of the default 0.5.
clf.predict()returns the argmax of probabilities;predict_proba()returns the raw probabilities.
Q3. What's the difference between precision and recall?
- Precision = "of the rows I predicted positive, how many really are?"; Recall = "of the truly positive rows, how many did I catch?"
- They're identical
- Precision is for regression
- Recall ignores false positives
Why: Precision penalizes false positives. Recall penalizes false negatives. Trade-off — they usually move in opposite directions as you change the threshold. F1 balances both.
Common doubts¶
When should I use ROC-AUC vs PR-AUC?
ROC-AUC for balanced datasets. PR-AUC (precision-recall AUC) for imbalanced datasets where the positive class is rare. ROC-AUC can look misleadingly good when negatives massively dominate. PR-AUC focuses on the positive class.
How do I handle class imbalance?
Several options: (1) class_weight="balanced" in the model, (2) resampling with imblearn (SMOTE for upsampling, RandomUnderSampler for downsampling), (3) threshold tuning — pick a probability threshold optimized for your business metric, not the default 0.5.
What's the threshold I should use for classification?
Depends on the cost of false positives vs false negatives. Bank fraud → low threshold (catch as many frauds as possible). Spam filter → high threshold (don't flag legitimate emails). Pick by computing precision/recall curves on a validation set.