#05 | The k-Means & Unsupervised Clustering Models

Challenge Importance

The time has come to add another layer to the hierarchy of Machine Learning models.

Do we have the variable we want to predict in the dataset?

YES: Supervised Learning

Predicting a Numerical Variable → Regression
Predicting a Categorical Variable → Classification

NO: Unsupervised Learning

Group Data Points based on Explanatory Variables → Cluster Analysis

We may have, for example, all football players, and we want to group them based on their performance. But we don't know the groups beforehand. So what do we do then?

We apply Unsupervised Machine Learning models to group the players based on their position in the space (determined by the explanatory variables): the closer the players are to the space, the more likely they'll be drawn to the same group.

Another typical example comes from e-commerces that don't know if their customers like clothing or tech. But they know how they interact on the website. Therefore, they group the customers to send promotional emails that align with their likes.

In short, we close the circle with the different types of Machine Learning models by adding this new type.

Let's now develop the Python code.

Load the Data

Imagine for a second you are the President of the United States of America, and you are considering creating campaigns to reduce car accidents due to alcohol consumption controlling the impact of insurance companies' losses (columns).

You won't create 51 TV campaigns for each of the States of the USA (rows). Instead, you will see which States behave similarly to cluster them into three groups.

import seaborn as sns #!
import pandas as pd

df_crashes = sns.load_dataset(name='car_crashes', index_col='abbrev')[['alcohol', 'ins_losses']]
df_crashes

Data Preprocessing

Missing Data

We don't have any missing data in any of the columns:

df_crashes.isna().sum()

alcohol 0 ins_losses 0 dtype: int64

Dummy Variables

Neither we need to convert categorical columns to dummy variables because the two we are considering are numerical.

df_crashes

How do we compute a k-Means Model in Python?

We should know from previous chapters that we need a function accessible from a Class in the library sklearn.

Import the Class

from sklearn.cluster import KMeans

Instantiante the Class

Create a copy of the original code blueprint to not "modify" the source code.

model_km = KMeans()

Access the Function

The theoretical action we'd like to perform is the one we executed in previous chapters. Therefore, the function to compute the Machine Learning model should be called the same way:

model_km.fit()

---------------------------------------------------------------------------

TypeError Traceback (most recent call last)

Input In [6], in <cell line: 1>() ----> 1 model_km.fit()

TypeError: fit() missing 1 required positional argument: 'X'

The previous types of models asked for two parameters:

y: target ~ independent ~ label ~ class variable
X: explanatory ~ dependent ~ feature variables

Why is it asking for just one parameter now, X?

As we said before, this model (unsupervised learning) doesn't know how the groups are calculated beforehand; they know after we compute the Machine Learning model. Therefore, they don't need to see the target variable y.

Separate the Variables

We don't need to separate the variables because we have explanatory ones.

Fit the Model

model_km.fit(X=df_crashes)

KMeans()

Predictions

Calculate Predictions

We have a fitted KMeans. Therefore, we should be able to apply the mathematical equation to the original data to get the predictions:

model_km.predict(X=df_crashes)

array([7, 3, 0, 7, 6, 7, 6, 1, 3, 7, 7, 5, 4, 7, 0, 5, 3, 3, 2, 4, 2, 3, 1, 3, 1, 7, 4, 5, 7, 5, 1, 5, 1, 3, 0, 3, 6, 0, 1, 1, 5, 4, 1, 1, 0, 0, 1, 0, 1, 0, 5], dtype=int32)

We wanted to calculate three groups, but Python is calculating eight groups. Let's modify this hyperparameter of the KMeans model:

model_km = KMeans(n_clusters=3)
model_km.fit(X=df_crashes)
model_km.predict(X=df_crashes)

array([0, 0, 1, 0, 2, 0, 2, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 2, 1, 2, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 2, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1], dtype=int32)

Add a New Column with the Predictions

Let's create a new DataFrame to keep the original dataset untouched:

df_pred = df_crashes.copy()

And add the predictions:

df_pred['pred_km'] = model_km.predict(X=df_crashes)
df_pred

How can we see the groups in the plot?

Model Visualization & Interpretation

Can you observe that the k-Means only considers the variable ins_losses to determine the group the point belongs to? Why?

sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_km',
                palette='Set1', data=df_pred);

The model measures the distance between the points. They seem to be spread around the plot but aren't; the plot doesn't place the points in perspective (it's lying to us).

Learn how to become an independent Machine Learning programmer who knows when to apply any ML algorithm to any dataset.

KMeans Algorithm

Take a look at the following video to understand how the KMeans algorithm computes the Mathematical Equation by calculating distances:

https://www.youtube.com/watch?v=4b5d3muPQmA

The model understands the data as follows:

import matplotlib.pyplot as plt

sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_km',
                palette='Set1', data=df_pred)
plt.xlim(0, 200)
plt.ylim(0, 200);

Now it's evident why the model only took into account ins_losses: it barely sees significant distances within alcohol compared to ins_losses.

In other words, with a metaphor: it is not the same to increase one kg of weight than one meter of height.

Then, how can we create a KMeans model that compares the two variables equally?

We need to scale the data (i.e., transforming the values into the same range: from 0 to 1) with the MinMaxScaler.

`MinMaxScaler()` the data

As with any other algorithm within the sklearn library, we need to:

Import the Class
Create the instance
fit() the numbers of the mathematical equation
predict/transform the data with the mathematical equation

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(df_crashes)
data = scaler.transform(df_crashes)
data[:5]

array([[0.47921847, 0.55636883], [0.34718769, 0.45684192], [0.42806394, 0.24636258], [0.50100651, 0.5323574 ], [0.20923623, 0.73980184]])

To better understand the information, let's convert the array into a DataFrame:

df_scaled = pd.DataFrame(data, columns=df_crashes.columns, index=df_crashes.index)
df_scaled

k-Means Model with Scaled Data

Fit the Model

model_km.fit(X=df_scaled)

KMeans(n_clusters=3)

Predictions

Calculate Predictions

We have a fitted KMeans. Therefore, we should be able to apply the mathematical equation to the original data to get the predictions:

model_km.predict(X=df_scaled)

array([1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 2, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 2, 0, 1, 0, 1, 0, 1, 0, 2, 0, 1, 0, 1, 1, 2, 2, 1, 1, 0, 0, 1, 0, 1, 0, 0], dtype=int32)

Add a New Column with the Predictions

df_pred['pred_km_scaled'] = model_km.predict(X=df_scaled)
df_pred

Model Visualization & Interpretation

We can observe now that both alcohol and ins_losses are taken into account by the model to calculate the cluster a point belongs to.

sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_km_scaled',
                palette='Set1', data=df_pred);

Takeaway

From now on, we should understand that every time a model calculates distances between variables of different numerical ranges, we need to scale the data to compare them properly.

The following figure gives an overview of everything that has happened so far:

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(14, 7))

sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_km',
                data=df_pred, palette='Set1', ax=ax1);

sns.scatterplot(x='alcohol', y='ins_losses', hue=df_pred.pred_km_scaled,
                data=df_scaled, palette='Set1', ax=ax2);

sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_km',
                data=df_pred, palette='Set1', ax=ax3);

sns.scatterplot(x='alcohol', y='ins_losses', hue=df_pred.pred_km_scaled,
                data=df_scaled, palette='Set1', ax=ax4);

ax3.set_xlim(0, 200)
ax3.set_ylim(0, 200)

ax4.set_xlim(0, 200)
ax4.set_ylim(0, 200)

ax1.set_title('KMeans w/ Original Data & Liar Plot')
ax2.set_title('KMeans w/ Scaled Data & Perspective Plot')
ax3.set_title('KMeans w/ Original Data & Perspective Plot')
ax4.set_title('KMeans w/ Original Data & Perspective Plot')

plt.tight_layout()

Other `Clustering` Models in Python

Visit the sklearn website to see how many different clustering methods are and how they differ from each other.

Let's pick two new models and compute them:

Agglomerative Clustering

We follow the same procedure as for any Machine Learning model from the Scikit-Learn library:

Fit the Model

from sklearn.cluster import AgglomerativeClustering

model_ac = AgglomerativeClustering(n_clusters=3)
model_ac.fit(df_scaled)

AgglomerativeClustering(n_clusters=3)

Calculate Predictions

model_ac.fit_predict(X=df_scaled)

array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 1, 0, 1, 0, 1, 0, 1, 2, 0, 0, 1, 0, 0, 2, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1])

Create a New Column for the Predictions

df_pred['pred_ac'] = model_ac.fit_predict(X=df_scaled)
df_pred

Visualize the Model

We can observe how the second group takes three points with the Agglomerative Clustering while the KMeans gather five points in the second group.

As they are different algorithms, they are expected to produce different results. If you'd like to understand which models you should use, you may know how the algorithm works. We don't explain it in this series because we want to make it simple.

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))

sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_km_scaled',
                data=df_pred, palette='Set1', ax=ax1);

sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_ac',
                data=df_pred, palette='Set1', ax=ax2)

ax1.set_title('KMeans')
ax2.set_title('Agglomerative Clustering');

Spectral Clustering

We follow the same procedure as for any Machine Learning model from the Scikit-Learn library:

Fit the Model

from sklearn.cluster import SpectralClustering

model_sc = SpectralClustering(n_clusters=3)
model_sc.fit(df_scaled)

SpectralClustering(n_clusters=3)

Calculate Predictions

model_sc.fit_predict(X=df_scaled)

array([0, 2, 2, 0, 0, 2, 0, 0, 2, 0, 0, 1, 2, 0, 2, 2, 2, 0, 0, 2, 0, 2, 0, 2, 0, 0, 1, 2, 0, 2, 0, 2, 0, 2, 1, 2, 0, 2, 0, 0, 1, 1, 0, 0, 2, 2, 0, 2, 0, 2, 2], dtype=int32)

Create a New Column for the Predictions

df_pred['pred_sc'] = model_sc.fit_predict(X=df_scaled)
df_pred

Visualize the Model

Let's visualize all models together and appreciate the minor differences because they cluster the groups differently.

fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 5))

ax1.set_title('KMeans')
sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_km_scaled',
                data=df_pred, palette='Set1', ax=ax1);

ax2.set_title('Agglomerative Clustering')
sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_ac',
                data=df_pred, palette='Set1', ax=ax2);

ax3.set_title('Spectral Clustering')
sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_sc',
                data=df_pred, palette='Set1', ax=ax3);

Takeaway

Once again, you don't need to know the maths behind every Machine Learning model to build them. However, I hope you are getting a sense of the patterns behind the Scikit-Learn library with this series of tutorials.

Use Case Conclusion

Let's arbitrarily choose the Agglomerative Clustering as our model and get back to you being the President of the USA. How would you describe the groups?

Higher ins_losses and lower alcohol
Lower ins_losses and lower alcohol
Lower ins_losses and higher alcohol

sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_ac', data=df_pred, palette='Set1');

You would create different messages on the TV campaigns for the three groups separately and avoid deploying many more resources to develop fifty-one various TV campaigns (one for each State), which doesn't make sense because many of them are similar.

Command Palette

Challenge Importance

Load the Data

Data Preprocessing

Missing Data

Dummy Variables

How do we compute a k-Means Model in Python?

Import the Class

Instantiante the Class

Access the Function

Separate the Variables

Fit the Model

Predictions

Calculate Predictions

Add a New Column with the Predictions

Model Visualization & Interpretation

KMeans Algorithm

MinMaxScaler() the data

k-Means Model with Scaled Data

Fit the Model

Predictions

Calculate Predictions

Add a New Column with the Predictions

Model Visualization & Interpretation

Takeaway

Other Clustering Models in Python

Agglomerative Clustering

Fit the Model

Calculate Predictions

Create a New Column for the Predictions

Visualize the Model

Spectral Clustering

Fit the Model

Calculate Predictions

Create a New Column for the Predictions

Visualize the Model

Takeaway

Use Case Conclusion

Comments

Machine Learning

#04 | Overfitting & Hyperparameter Tuning with Cross Validation

More from this blog

`MinMaxScaler()` the data

Other `Clustering` Models in Python