Visualise Data to build a Linear Regression model

What is a plot?

A visual representation of the data

Which data? How is it usually structured?

In a table. For example:

import seaborn as sns

df = sns.load_dataset('mpg', index_col='name')
df.head()

How can you Visualice this DataFrame?

We could make a point for every car based on
1. weight
2. mpg

sns.scatterplot(x='weight', y='mpg', data=df);

Which conclusions can you make out of this plot?

Well, you may observe that the location of the points are descending as we move to the right
This means that the weight of the car may produce a lower capacity to make kilometres mpg

How can you measure this relationship?

Linear Regression

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X=df[['weight']], y=df.mpg)
model.__dict__

Resulting in ↓

{'fit_intercept': True,
 'normalize': False,
 'copy_X': True,
 'n_jobs': None,
 'n_features_in_': 1,
 'coef_': array([-0.00767661]),
 '_residues': 7474.8140143821,
 'rank_': 1,
 'singular_': array([16873.20281508]),
 'intercept_': 46.31736442026565}

Which is the mathematical formula for this relationship?

$$mpg = 46.31 - 0.00767 \cdot weight$$

This equation means that the mpg gets 0.00767 units lower for every unit that weight increases.

Could you visualise this equation in a plot?

Absolutely, we could make the predictions from the original data and plot them.

Predictions

y_pred = model.predict(X=df[['weight']])

dfsel = df[['weight', 'mpg']].copy()
dfsel['prediction'] = y_pred

dfsel.head()

	weight	mpg	prediction
name
chevrolet chevelle malibu	3504	18.0	19.418523
buick skylark 320	3693	15.0	17.967643
plymouth satellite	3436	18.0	19.940532
amc rebel sst	3433	16.0	19.963562
ford torino	3449	17.0	19.840736

Out of this table, you could observe that predictions don't exactly match the reality, but it approximates.
For example, Ford Torino's mpg is 17.0, but our model predicts 19.84.

Model Visualization

sns.scatterplot(x='weight', y='mpg', data=dfsel)
sns.scatterplot(x='weight', y='prediction', data=dfsel);

The blue points represent the actual data.
The orange points represent the predictions of the model.

I teach Python, R, Statistics & Data Science. I like to produce content that helps people to understand these topics better.

Feel free and welcomed to give me feedback as I would like to make my tutorials clearer and generate content that interests you 🤗

You can see my Tutor Profile here if you need Private Tutoring lessons.