Try It Yourself¶

Three small exercises. Edit the code in each runner, press Run, and see if you can solve it.

Exercise 1 — Predict ice cream sales from temperature¶

Below is a dataset of daily ice cream sales (y) and the temperature that day (x).

Task: Fit a simple linear regression. Then predict sales at 35°C.

from sklearn.linear_model import LinearRegression
import numpy as np

# Temperature (°C)
X = np.array([[10], [15], [20], [25], [30], [35], [40]])
# Sales (units)
y = np.array([120, 180, 240, 310, 390, 470, 550])

# YOUR CODE HERE:
# 1. Fit a LinearRegression on X and y
# 2. Print the slope and intercept
# 3. Predict sales at 35°C
# 4. Predict sales at 50°C — does this seem realistic?

model = LinearRegression().fit(X, y)
print("slope:    ", round(model.coef_[0], 2))
print("intercept:", round(model.intercept_, 2))
print("Predict 35°C:", model.predict([[35]])[0])
print("Predict 50°C:", model.predict([[50]])[0], "← extrapolation, be careful!")

Lesson: Linear models will happily predict anything you ask, even far outside the training range. Sales at 50°C is extrapolation — the linear assumption probably breaks at extreme temperatures.

Exercise 2 — Add a feature, watch R² go up¶

The housing dataset. Try fitting a model with just MedInc vs all 8 features.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

data = fetch_california_housing(as_frame=True)
X, y = data.data, data.target

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)

# A: Fit using only MedInc
pipe_a = Pipeline([("scale", StandardScaler()), ("lr", LinearRegression())])
pipe_a.fit(X_tr[["MedInc"]], y_tr)
print("MedInc only       — Test R²:", round(pipe_a.score(X_te[["MedInc"]], y_te), 3))

# B: Fit using all 8 features
pipe_b = Pipeline([("scale", StandardScaler()), ("lr", LinearRegression())])
pipe_b.fit(X_tr, y_tr)
print("All 8 features    — Test R²:", round(pipe_b.score(X_te, y_te), 3))

Lesson: More features help — up to a point. Beyond ~10-15 features you'd start to overfit without regularization (see Ridge/Lasso).

Exercise 3 — Underfit a parabola¶

This dataset is shaped like y = x². Fit a linear model. See how badly it fails.

Then try adding a polynomial feature (x²) and see the R² jump.

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
import numpy as np

X = np.linspace(-3, 3, 50).reshape(-1, 1)
y = X.ravel() ** 2 + np.random.normal(0, 0.5, 50)

# Plain linear — fails on a parabola
linear = LinearRegression().fit(X, y)
print(f"Plain linear R²:       {linear.score(X, y):.3f}")

# Polynomial degree 2 — adds an x² feature
poly = Pipeline([
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),
    ("lr",   LinearRegression()),
]).fit(X, y)
print(f"Polynomial degree 2 R²: {poly.score(X, y):.3f}")

Lesson: Linear models can capture non-linear shapes — if you add the right features. This is what Polynomial Regression is about.

What's next?¶

You've completed the Linear Regression tutorial. Next steps:

Polynomial & Regularization — when straight lines aren't enough.
Logistic Regression — same machinery, but for classification.
Ensembles — tree-based models that usually outperform linear on tabular data.

Practice¶

What does this print?

Expected: True

# Slope of best-fit line through (1, 2), (2, 4), (3, 6) is exactly 2 (y = 2x)
import numpy as np
from sklearn.linear_model import LinearRegression
X = np.array([[1], [2], [3]])
y = np.array([2, 4, 6])
model = LinearRegression().fit(X, y)
print(round(model.coef_[0]) == 2)

Add a 6^th data point and re-fit (the model should still find slope ≈ 2)

Expected: True

import numpy as np
from sklearn.linear_model import LinearRegression
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])
model = LinearRegression().fit(X, y)        # bug: works, but add a 6th point as practice
# We want the slope to remain 2 even after the 6th point (10, 20)
print(round(model.coef_[0]) == 2)

Quiz — Quick check¶

What you remember

Q1. After fitting a linear regression, what do model.coef_ and model.intercept_ give you?

The slope(s) and the y-intercept (b in y = mx + b)
The training error
Test accuracy
The dataset

Why: Linear regression is fully described by its coefficients and intercept. predict(x) = (coef × x).sum() + intercept.

Q2. If you train on 50 samples and your test R² is much lower than train R², the model is…

Underfit
Overfit — fits the training data well but doesn't generalize
Just right
Broken

Why: A large train-test gap means the model memorized noise specific to training. Add regularization (Ridge), reduce model complexity, or get more training data.

Q3. What does cross_val_score(model, X, y, cv=5) return?

One score
An array of 5 scores — one per fold
The mean only
The best fold

Why: 5-fold cross-validation splits the data into 5 chunks, trains on 4, tests on the 5^th, rotates 5 times. The mean and std of these scores give a robust estimate of model performance.

Common doubts¶

What's the first thing I should try on a new regression problem?

LinearRegression() as a baseline. Then Ridge() for safety. If you can't beat these by a meaningful margin, your features need work — not your model. If you can beat them substantially, the relationship is non-linear and a tree-based model will probably do even better.

Should I always use cross-validation?

For datasets under ~10,000 rows, yes. CV gives a more reliable estimate of generalization than a single train/test split. For very large datasets (millions of rows), a single split is fine — the variance in your estimate is already small.

How do I know when I've done 'enough' feature engineering?

When adding new features no longer improves your validation R². Track CV score as you add features; once you hit the plateau, stop and switch to model tuning. Often you spend 80% of your time on features, 20% on model choice.

← Back to Linear Regression Home