Try It Yourself¶
Three small exercises. Edit the code in each runner, press Run, and see if you can solve it.
Exercise 1 — Predict ice cream sales from temperature¶
Below is a dataset of daily ice cream sales (y) and the temperature that day (x).
Task: Fit a simple linear regression. Then predict sales at 35°C.
from sklearn.linear_model import LinearRegression
import numpy as np
# Temperature (°C)
X = np.array([[10], [15], [20], [25], [30], [35], [40]])
# Sales (units)
y = np.array([120, 180, 240, 310, 390, 470, 550])
# YOUR CODE HERE:
# 1. Fit a LinearRegression on X and y
# 2. Print the slope and intercept
# 3. Predict sales at 35°C
# 4. Predict sales at 50°C — does this seem realistic?
model = LinearRegression().fit(X, y)
print("slope: ", round(model.coef_[0], 2))
print("intercept:", round(model.intercept_, 2))
print("Predict 35°C:", model.predict([[35]])[0])
print("Predict 50°C:", model.predict([[50]])[0], "← extrapolation, be careful!")
Lesson: Linear models will happily predict anything you ask, even far outside the training range. Sales at 50°C is extrapolation — the linear assumption probably breaks at extreme temperatures.
Exercise 2 — Add a feature, watch R² go up¶
The housing dataset. Try fitting a model with just MedInc vs all 8 features.
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
data = fetch_california_housing(as_frame=True)
X, y = data.data, data.target
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)
# A: Fit using only MedInc
pipe_a = Pipeline([("scale", StandardScaler()), ("lr", LinearRegression())])
pipe_a.fit(X_tr[["MedInc"]], y_tr)
print("MedInc only — Test R²:", round(pipe_a.score(X_te[["MedInc"]], y_te), 3))
# B: Fit using all 8 features
pipe_b = Pipeline([("scale", StandardScaler()), ("lr", LinearRegression())])
pipe_b.fit(X_tr, y_tr)
print("All 8 features — Test R²:", round(pipe_b.score(X_te, y_te), 3))
Lesson: More features help — up to a point. Beyond ~10-15 features you'd start to overfit without regularization (see Ridge/Lasso).
Exercise 3 — Underfit a parabola¶
This dataset is shaped like y = x². Fit a linear model. See how badly it fails.
Then try adding a polynomial feature (x²) and see the R² jump.
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
import numpy as np
X = np.linspace(-3, 3, 50).reshape(-1, 1)
y = X.ravel() ** 2 + np.random.normal(0, 0.5, 50)
# Plain linear — fails on a parabola
linear = LinearRegression().fit(X, y)
print(f"Plain linear R²: {linear.score(X, y):.3f}")
# Polynomial degree 2 — adds an x² feature
poly = Pipeline([
("poly", PolynomialFeatures(degree=2, include_bias=False)),
("lr", LinearRegression()),
]).fit(X, y)
print(f"Polynomial degree 2 R²: {poly.score(X, y):.3f}")
Lesson: Linear models can capture non-linear shapes — if you add the right features. This is what Polynomial Regression is about.
What's next?¶
You've completed the Linear Regression tutorial. Next steps:
- Polynomial & Regularization — when straight lines aren't enough.
- Logistic Regression — same machinery, but for classification.
- Ensembles — tree-based models that usually outperform linear on tabular data.
Practice¶
What does this print?
Expected: True
Add a 6th data point and re-fit (the model should still find slope ≈ 2)
Expected: True
import numpy as np
from sklearn.linear_model import LinearRegression
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])
model = LinearRegression().fit(X, y) # bug: works, but add a 6th point as practice
# We want the slope to remain 2 even after the 6th point (10, 20)
print(round(model.coef_[0]) == 2)
Quiz — Quick check¶
What you remember
Q1. After fitting a linear regression, what do model.coef_ and model.intercept_ give you?
- The slope(s) and the y-intercept (
bin y = mx + b) - The training error
- Test accuracy
- The dataset
Why: Linear regression is fully described by its coefficients and intercept.
predict(x) = (coef × x).sum() + intercept.
Q2. If you train on 50 samples and your test R² is much lower than train R², the model is…
- Underfit
- Overfit — fits the training data well but doesn't generalize
- Just right
- Broken
Why: A large train-test gap means the model memorized noise specific to training. Add regularization (Ridge), reduce model complexity, or get more training data.
Q3. What does cross_val_score(model, X, y, cv=5) return?
- One score
- An array of 5 scores — one per fold
- The mean only
- The best fold
Why: 5-fold cross-validation splits the data into 5 chunks, trains on 4, tests on the 5th, rotates 5 times. The mean and std of these scores give a robust estimate of model performance.
Common doubts¶
What's the first thing I should try on a new regression problem?
LinearRegression() as a baseline. Then Ridge() for safety. If you can't beat these by a meaningful margin, your features need work — not your model. If you can beat them substantially, the relationship is non-linear and a tree-based model will probably do even better.
Should I always use cross-validation?
For datasets under ~10,000 rows, yes. CV gives a more reliable estimate of generalization than a single train/test split. For very large datasets (millions of rows), a single split is fine — the variance in your estimate is already small.
How do I know when I've done 'enough' feature engineering?
When adding new features no longer improves your validation R². Track CV score as you add features; once you hit the plateau, stop and switch to model tuning. Often you spend 80% of your time on features, 20% on model choice.