Skip to content

Multiple Linear Regression

When you have more than one input feature, the line becomes a flat plane (or, in higher dimensions, a "hyperplane"):

y = β₀ + β₁·x₁ + β₂·x₂ + … + βₚ·xₚ

Each xᵢ is a column in your dataset. Each βᵢ tells you the effect of that column on y, holding all other columns fixed.

Example — predicting a student's exam score

Inputs: hours studied, hours slept. Output: exam score.

from sklearn.linear_model import LinearRegression
import numpy as np

# 6 students. Columns: [hours_studied, hours_slept]
X = np.array([
    [2, 6],
    [3, 7],
    [5, 8],
    [7, 6],
    [9, 7],
    [10, 8],
])
y = np.array([55, 65, 80, 78, 92, 100])

model = LinearRegression().fit(X, y)

print("β₁ (study hours):", round(model.coef_[0], 2))
print("β₂ (sleep hours):", round(model.coef_[1], 2))
print("β₀ (intercept) :", round(model.intercept_, 2))

# Predict a new student: 6 hours studied, 7 hours slept
pred = model.predict([[6, 7]])[0]
print(f"Predicted score: {pred:.1f}")

The coefficients answer real questions: - "Every extra hour of study adds ~β₁ points." - "Every extra hour of sleep adds ~β₂ points."

Why scaling matters

If one column has a wildly different range (e.g. income in dollars vs. age in years), it'll dominate the coefficients in absolute terms — making them harder to compare. Best practice: scale features before fitting so coefficients are comparable. (We cover this in Feature Scaling.)

The predictions are identical with or without scaling for plain LinearRegression; scaling just makes the coefficients readable.

Try it with scaled features

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np

X = np.array([
    [2, 6], [3, 7], [5, 8], [7, 6], [9, 7], [10, 8],
])
y = np.array([55, 65, 80, 78, 92, 100])

pipe = Pipeline([
    ("scale", StandardScaler()),
    ("lr",    LinearRegression()),
]).fit(X, y)

print("Coefficients (on scaled inputs):")
print("  study:", round(pipe.named_steps["lr"].coef_[0], 2))
print("  sleep:", round(pipe.named_steps["lr"].coef_[1], 2))
print("R² on training data:", round(pipe.score(X, y), 3))

After scaling, the larger coefficient = the more influential feature. Here you'll see that study hours dominate.

What you learned

  • Multiple linear regression handles many features at once.
  • .coef_ is a list — one per feature.
  • Pipeline([scaler, model]) is the production pattern.
  • Use .score(X, y) for R² on the training data (we'll cover proper evaluation soon).

Practice

What does this print?

Expected: 2

import numpy as np
from sklearn.linear_model import LinearRegression
X = np.array([[1, 1], [2, 1], [3, 1], [1, 2], [2, 2]])
y = np.array([3, 5, 7, 4, 6])
model = LinearRegression().fit(X, y)
print(len(model.coef_))     # one coefficient per feature

Add the bias/intercept correctly when interpreting predictions

Expected: True

import numpy as np
from sklearn.linear_model import LinearRegression
X = np.array([[1], [2], [3]])
y = np.array([5, 7, 9])     # y = 2x + 3
model = LinearRegression().fit(X, y)
predicted = model.coef_[0] * 5         # bug: forgot the intercept — should be coef*5 + intercept
print(round(predicted + model.intercept_) == 13)

Quiz — Quick check

What you remember

Q1. For 3 features, model.coef_ has how many values?

  • 1
  • 3
  • 9
  • Depends on the data

Why: One coefficient per input feature. Plus a single model.intercept_ for the bias term.

Q2. What does model.intercept_ represent?

  • The predicted value of y when all features are 0
  • The largest coefficient
  • The mean of y
  • The R² score

Why: Geometrically, where the regression hyperplane crosses the y-axis. With centered/scaled features, it's often the mean of y.

Q3. Why might two correlated features cause unstable coefficients?

  • sklearn bug
  • Multicollinearity — the regression can't tell which of the correlated features deserves the credit, so coefficients vary wildly with small data changes
  • Memory issues
  • Float precision

Why: If x1 ≈ x2, the model can shift weight between them arbitrarily. The total prediction stays similar, but individual coefficients become unstable. Fix: drop one, use Ridge regularization, or check VIF.

Common doubts

How do I tell which feature is most important?

With scaled features, the magnitude of the coefficient tells you the relative importance. Without scaling, large-scale features get small coefficients (and vice versa), so magnitudes aren't comparable. Always scale before interpreting coefficient importance.

Can I use linear regression with hundreds of features?

Yes, but watch for: (1) multicollinearity (use Ridge), (2) overfitting (use cross-validation and L1/L2 regularization), (3) the curse of dimensionality (features may need pre-selection). Lasso (L1) can shrink unimportant coefficients to zero — automatic feature selection.

Why do my coefficients have such different scales?

Because the features themselves have different scales. A feature in millions gets a tiny coefficient (large numbers × small coef = reasonable contribution). A feature in [0, 1] needs a larger coefficient. Always scale before interpreting; the underlying math doesn't change.

Next: Gradient Descent