Skip to content

Logistic Regression & Classification Metrics

1. Why this matters

Logistic regression is: - The default first model for any classification problem. Fast, interpretable, calibrated probabilities. - The foundation under almost every classification algorithm — neural network classification heads, gradient boosting log-loss, etc.

And metrics — getting these wrong silently destroys real-world ML. A 99% accuracy spam classifier that flags every email as "not spam" looks great by accuracy and is completely useless.

2. Mental model

linear part:   z = β₀ + β₁x₁ + ... + βₚxₚ
sigmoid:       σ(z) = 1 / (1 + e⁻ᶻ)
output:        P(y=1 | x) = σ(z)
decision:      ŷ = 1 if P > 0.5 else 0    (threshold is tunable!)

The sigmoid squashes any real z into (0, 1):

        1.0  |          ____
             |       __/
        0.5  |    __/
             | __/
        0.0  |/______________
                  z=0        z→∞

Training picks β to minimize log-loss (binary cross-entropy):

L = − mean(y log(σ(z)) + (1 - y) log(1 - σ(z)))

No closed form — solved by gradient descent (or its variants).

3. Architecture / Flow

flowchart LR
    X[X features] --> S[Scale]
    S --> L[β₀ + β·X]
    L --> SIG[σ z]
    SIG --> P[Probability 0-1]
    P --> T{> threshold?}
    T -->|yes| C1[class 1]
    T -->|no| C0[class 0]

4. Binary classification

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score

X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, stratify=y, random_state=42)

pipe = Pipeline([
    ("scale", StandardScaler()),
    ("clf",   LogisticRegression(
        C=1.0,                          # inverse of α — smaller = more reg
        penalty="l2",                   # "l1", "l2", "elasticnet", "none"
        solver="lbfgs",                 # "lbfgs", "liblinear", "saga"
        max_iter=1000,
        class_weight=None,              # "balanced" for imbalanced data
    )),
]).fit(X_tr, y_tr)

print("Accuracy:", pipe.score(X_te, y_te))
print(classification_report(y_te, pipe.predict(X_te)))
print("ROC-AUC :", roc_auc_score(y_te, pipe.predict_proba(X_te)[:, 1]))

C is the inverse regularization strength (sklearn quirk): smaller C = stronger penalty. Default C=1.0.

predict_proba(X) returns probabilities per class — essential for thresholding and ROC.

5. Multi-class

Two strategies, both built-in:

# One-vs-Rest (OvR): K binary classifiers, one per class
LogisticRegression(multi_class="ovr")

# Multinomial (softmax): one joint model — usually better
LogisticRegression(multi_class="multinomial", solver="lbfgs")   # default in modern sklearn
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(max_iter=500).fit(X, y)
clf.predict_proba(X[:1])   # → [[p0, p1, p2]]

6. Classification metrics — the full table

For binary problems with the confusion matrix:

Predicted 1 Predicted 0
Actual 1 TP FN
Actual 0 FP TN
Metric Formula What it answers
Accuracy (TP+TN) / total "What fraction were correct?" — misleading on imbalanced data
Precision TP / (TP+FP) "When the model says positive, how often is it right?"
Recall (Sensitivity, TPR) TP / (TP+FN) "Of all real positives, how many did we catch?"
Specificity (TNR) TN / (TN+FP) "Of all real negatives, how many did we correctly identify?"
F1 2·P·R / (P+R) Harmonic mean of precision and recall
ROC-AUC area under TPR vs FPR curve "How well does the model RANK positives above negatives?" — threshold-independent
Log-loss − Σ y·log(p) + (1-y)·log(1-p) Penalizes confident-wrong probabilities
PR-AUC area under precision-recall Better than ROC-AUC for extreme imbalance
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report,
    roc_auc_score, roc_curve, precision_recall_curve, average_precision_score,
    log_loss,
)

y_pred  = pipe.predict(X_te)
y_proba = pipe.predict_proba(X_te)[:, 1]

print("Acc :", accuracy_score(y_te, y_pred))
print("Prec:", precision_score(y_te, y_pred))
print("Rec :", recall_score(y_te, y_pred))
print("F1  :", f1_score(y_te, y_pred))
print("AUC :", roc_auc_score(y_te, y_proba))
print("LL  :", log_loss(y_te, y_proba))
print(classification_report(y_te, y_pred))
print("Confusion:\n", confusion_matrix(y_te, y_pred))

7. Which metric should you pick?

Decision tree:

flowchart TD
    A{Class balance?} -->|roughly 50/50| AC[Accuracy or F1 OK]
    A -->|imbalanced| B{Cost asymmetry?}
    B -->|FP and FN equally bad| F1[F1]
    B -->|FP much worse e.g., spam filter| PR[Precision]
    B -->|FN much worse e.g., cancer screening| RC[Recall]
    B -->|need ranking quality| AUC[ROC-AUC<br/>or PR-AUC if heavy imbalance]

Concrete examples:

Problem Class balance Right metric Why
Cancer screening imbalanced (rare) Recall + PR-AUC False negatives kill people
Spam filter imbalanced Precision False positives lose legitimate mail
Search ranking doesn't matter ROC-AUC / NDCG Ranking quality, not classification
Credit fraud extreme imbalance PR-AUC + Recall@k Catch as many frauds as possible per investigation
General classifier ~balanced F1 + Accuracy Cheap defaults

8. The threshold matters

Default predict() uses 0.5. You should tune this based on the cost of FP vs FN:

from sklearn.metrics import precision_recall_curve
import numpy as np

prec, rec, thr = precision_recall_curve(y_te, y_proba)

# Pick threshold that maximizes F1
f1s = 2 * prec * rec / (prec + rec + 1e-9)
best_idx = np.argmax(f1s)
best_thr = thr[best_idx]
print(f"Best threshold: {best_thr:.3f}, F1: {f1s[best_idx]:.3f}")

# Apply
y_pred_new = (y_proba >= best_thr).astype(int)

Visualize the trade-off:

import matplotlib.pyplot as plt
plt.plot(rec, prec)
plt.xlabel("Recall"); plt.ylabel("Precision")
plt.title("Precision-Recall curve")

9. Imbalanced data — the cheats

# 1. class_weight="balanced" — weights inversely proportional to class frequencies
LogisticRegression(class_weight="balanced")
# or custom:
LogisticRegression(class_weight={0: 1, 1: 5})

# 2. Stratified split (always — keeps class proportions)
train_test_split(X, y, stratify=y, ...)
StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# 3. Resampling — imbalanced-learn package
# pip install imbalanced-learn
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline

pipe = ImbPipeline([
    ("scale", StandardScaler()),
    ("smote", SMOTE(random_state=42)),       # synthesize minority samples
    ("clf",   LogisticRegression(max_iter=500)),
])

# 4. Right metrics — never accuracy. Use F1, PR-AUC, balanced accuracy.

10. Common pitfalls

  • Reporting accuracy on imbalanced data. 95% accuracy on a 95/5 problem is the trivial "predict majority class" baseline.
  • Confusing C with α. sklearn's C is the inverse of regularization strength. Smaller C = more reg.
  • Using predict() when you should use predict_proba(). Probabilities give you threshold-tuning, ranking, and richer metrics.
  • Forgetting stratify=y on train_test_split for imbalanced data. Random splits can leave a class missing from test.
  • No baseline. Always compare to "predict majority class" — it's frustratingly hard to beat sometimes.
  • Calibration mismatch. Logistic regression is well-calibrated. Random forest / SVM are not — their predict_proba outputs don't correspond to real probabilities. Wrap with CalibratedClassifierCV if you need calibrated probabilities.
  • Ignoring class order in roc_auc_score. Pass predict_proba()[:, 1] — the second column = probability of positive class.
  • Mixing roc_auc_score with multi-class without multi_class= argument. Defaults differ across versions.

11. When to use vs not use

Use Logistic Regression when Use something else when
Need calibrated probabilities Pure rank/score is fine — try GBM, AUC matters more
Need interpretability (coefficients) Black-box accuracy is OK — try GBM
Linear separability or close to it Strong feature interactions you can't enumerate
Want a fast, robust baseline High-stakes accuracy on rich tabular data → XGBoost / LightGBM
Multi-class with 3–10 classes 1000+ classes → softmax neural net or hierarchical models
Streaming / online learning SGDClassifier(loss="log_loss")

12. Cheatsheet

from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, average_precision_score, log_loss,
    confusion_matrix, classification_report,
    precision_recall_curve, roc_curve, ConfusionMatrixDisplay,
)
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV

# Canonical pipeline
pipe = Pipeline([
    ("scale", StandardScaler()),
    ("clf",   LogisticRegression(max_iter=1000, class_weight="balanced")),
])

# Hyperparam tuning
grid = GridSearchCV(pipe, param_grid={
    "clf__C":       [0.01, 0.1, 1, 10, 100],
    "clf__penalty": ["l2"],
    # for L1/elasticnet: "clf__solver": ["saga"], "clf__l1_ratio": [0.5]
}, scoring="f1", cv=5).fit(X_tr, y_tr)
print(grid.best_params_)

# Stratified CV (essential for imbalanced data)
StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Probabilities + custom threshold
proba  = pipe.predict_proba(X_te)[:, 1]
y_pred = (proba >= 0.3).astype(int)        # custom threshold

# Inspect confusion + report
ConfusionMatrixDisplay.from_estimator(pipe, X_te, y_te)
print(classification_report(y_te, y_pred))

13. Q&A — recall test

  • Q: Why is accuracy a bad metric for imbalanced data? A: A classifier that always predicts the majority class achieves accuracy = majority proportion (e.g., 95% on a 95/5 problem) while doing nothing useful.

  • Q: Precision vs Recall — which captures false alarms? A: Precision — TP / (TP+FP). Low precision = many false alarms. Use when FPs are costly (spam, ads).

  • Q: When does ROC-AUC mislead? A: On extreme class imbalance (1:1000). The curve is dominated by the majority. Prefer PR-AUC.

  • Q: What does C=0.01 vs C=100 do in sklearn LogisticRegression? A: C is the INVERSE of α. C=0.01 → strong regularization, simpler model. C=100 → weak regularization, potentially overfits.

  • Q: Probabilities from predict_proba() for RandomForestClassifier — can you trust them? A: Not directly — they're not well-calibrated. Wrap with CalibratedClassifierCV(method="isotonic") if calibrated probabilities matter.

  • Q: How do you change the classification threshold from 0.5? A: Skip predict(). Use proba = clf.predict_proba(X)[:, 1]; y_pred = (proba >= 0.3).astype(int).

Practice

What does this print?

Expected: True

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100, random_state=0)
clf = LogisticRegression().fit(X, y)
proba = clf.predict_proba(X[:1])
print(round(proba.sum(), 2) == 1.0)    # probabilities sum to 1 per row

Use F1 score (not accuracy) for a 95/5 imbalanced classification

Expected: True

from sklearn.metrics import accuracy_score
y_true = [0]*95 + [1]*5
y_pred = [0]*100                      # always predicts majority — useless model
print(accuracy_score(y_true, y_pred) < 0.9)   # bug: accuracy is 0.95, looks great but model is broken

Quiz — Quick check

What you remember

Q1. Which metric is misleading on an imbalanced dataset (e.g., 99% negative)?

  • Accuracy
  • Precision
  • Recall
  • F1

Why: A model that always predicts "negative" gets 99% accuracy with zero useful predictions. Use precision/recall/F1 or ROC-AUC for imbalanced problems.

Q2. What does predict_proba return?

  • The predicted class
  • An (n_samples, n_classes) array of probabilities per row
  • The model's coefficients
  • An error score

Why: Useful for picking a custom threshold instead of the default 0.5. clf.predict() returns the argmax of probabilities; predict_proba() returns the raw probabilities.

Q3. What's the difference between precision and recall?

  • Precision = "of the rows I predicted positive, how many really are?"; Recall = "of the truly positive rows, how many did I catch?"
  • They're identical
  • Precision is for regression
  • Recall ignores false positives

Why: Precision penalizes false positives. Recall penalizes false negatives. Trade-off — they usually move in opposite directions as you change the threshold. F1 balances both.

Common doubts

When should I use ROC-AUC vs PR-AUC?

ROC-AUC for balanced datasets. PR-AUC (precision-recall AUC) for imbalanced datasets where the positive class is rare. ROC-AUC can look misleadingly good when negatives massively dominate. PR-AUC focuses on the positive class.

How do I handle class imbalance?

Several options: (1) class_weight="balanced" in the model, (2) resampling with imblearn (SMOTE for upsampling, RandomUnderSampler for downsampling), (3) threshold tuning — pick a probability threshold optimized for your business metric, not the default 0.5.

What's the threshold I should use for classification?

Depends on the cost of false positives vs false negatives. Bank fraud → low threshold (catch as many frauds as possible). Spam filter → high threshold (don't flag legitimate emails). Pick by computing precision/recall curves on a validation set.