Multiple Linear Regression¶
When you have more than one input feature, the line becomes a flat plane (or, in higher dimensions, a "hyperplane"):
Each xᵢ is a column in your dataset. Each βᵢ tells you the effect of that column on y, holding all other columns fixed.
Example — predicting a student's exam score¶
Inputs: hours studied, hours slept. Output: exam score.
from sklearn.linear_model import LinearRegression
import numpy as np
# 6 students. Columns: [hours_studied, hours_slept]
X = np.array([
[2, 6],
[3, 7],
[5, 8],
[7, 6],
[9, 7],
[10, 8],
])
y = np.array([55, 65, 80, 78, 92, 100])
model = LinearRegression().fit(X, y)
print("β₁ (study hours):", round(model.coef_[0], 2))
print("β₂ (sleep hours):", round(model.coef_[1], 2))
print("β₀ (intercept) :", round(model.intercept_, 2))
# Predict a new student: 6 hours studied, 7 hours slept
pred = model.predict([[6, 7]])[0]
print(f"Predicted score: {pred:.1f}")
The coefficients answer real questions:
- "Every extra hour of study adds ~β₁ points."
- "Every extra hour of sleep adds ~β₂ points."
Why scaling matters¶
If one column has a wildly different range (e.g. income in dollars vs. age in years), it'll dominate the coefficients in absolute terms — making them harder to compare. Best practice: scale features before fitting so coefficients are comparable. (We cover this in Feature Scaling.)
The predictions are identical with or without scaling for plain LinearRegression; scaling just makes the coefficients readable.
Try it with scaled features¶
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np
X = np.array([
[2, 6], [3, 7], [5, 8], [7, 6], [9, 7], [10, 8],
])
y = np.array([55, 65, 80, 78, 92, 100])
pipe = Pipeline([
("scale", StandardScaler()),
("lr", LinearRegression()),
]).fit(X, y)
print("Coefficients (on scaled inputs):")
print(" study:", round(pipe.named_steps["lr"].coef_[0], 2))
print(" sleep:", round(pipe.named_steps["lr"].coef_[1], 2))
print("R² on training data:", round(pipe.score(X, y), 3))
After scaling, the larger coefficient = the more influential feature. Here you'll see that study hours dominate.
What you learned¶
- Multiple linear regression handles many features at once.
.coef_is a list — one per feature.Pipeline([scaler, model])is the production pattern.- Use
.score(X, y)for R² on the training data (we'll cover proper evaluation soon).
Practice¶
What does this print?
Expected: 2
Add the bias/intercept correctly when interpreting predictions
Expected: True
import numpy as np
from sklearn.linear_model import LinearRegression
X = np.array([[1], [2], [3]])
y = np.array([5, 7, 9]) # y = 2x + 3
model = LinearRegression().fit(X, y)
predicted = model.coef_[0] * 5 # bug: forgot the intercept — should be coef*5 + intercept
print(round(predicted + model.intercept_) == 13)
Quiz — Quick check¶
What you remember
Q1. For 3 features, model.coef_ has how many values?
- 1
- 3
- 9
- Depends on the data
Why: One coefficient per input feature. Plus a single
model.intercept_for the bias term.
Q2. What does model.intercept_ represent?
- The predicted value of
ywhen all features are 0 - The largest coefficient
- The mean of
y - The R² score
Why: Geometrically, where the regression hyperplane crosses the y-axis. With centered/scaled features, it's often the mean of
y.
Q3. Why might two correlated features cause unstable coefficients?
- sklearn bug
- Multicollinearity — the regression can't tell which of the correlated features deserves the credit, so coefficients vary wildly with small data changes
- Memory issues
- Float precision
Why: If
x1 ≈ x2, the model can shift weight between them arbitrarily. The total prediction stays similar, but individual coefficients become unstable. Fix: drop one, use Ridge regularization, or check VIF.
Common doubts¶
How do I tell which feature is most important?
With scaled features, the magnitude of the coefficient tells you the relative importance. Without scaling, large-scale features get small coefficients (and vice versa), so magnitudes aren't comparable. Always scale before interpreting coefficient importance.
Can I use linear regression with hundreds of features?
Yes, but watch for: (1) multicollinearity (use Ridge), (2) overfitting (use cross-validation and L1/L2 regularization), (3) the curse of dimensionality (features may need pre-selection). Lasso (L1) can shrink unimportant coefficients to zero — automatic feature selection.
Why do my coefficients have such different scales?
Because the features themselves have different scales. A feature in millions gets a tiny coefficient (large numbers × small coef = reasonable contribution). A feature in [0, 1] needs a larger coefficient. Always scale before interpreting; the underlying math doesn't change.