#06 | The Principal Component Analysis (PCA) & Dimensionality Reduction Techniques

Learn how to group variables to plot seven variables in a 2-dimensional plot and why PCA is an essential technique to visualize clustering models.

#06 | The Principal Component Analysis (PCA) & Dimensionality Reduction Techniques

© Jesús López 2022

Ask him any doubt on Twitter or LinkedIn

Chapter Importance

We used just two variables out of the seven we had in the whole DataFrame.

We could have computed better cluster models by giving more information to the Machine Learning model. Nevertheless, it would have been harder to plot seven variables with seven axes in a graph.

Is there anything we can do to compute a clustering model with more than two variables and later represent all the points along with their variables?

  • Yes, everything is possible with data. As one of my teachers told me: "you can torture the data until it gives you what you want" (sometimes it's unethical, so behave).

We'll develop the code to show you the need for dimensionality reduction techniques. Specifically, the Principal Component Analysis (PCA).

Load the Data

Imagine for a second you are the president of the United States of America, and you are considering creating campaigns to reduce car accidents.

You won't create 51 TV campaigns, one for each of the States of the USA (rows). Instead, you will see which States behave similarly to cluster them into 3 groups based on the variation across their features (columns).

import seaborn as sns #!

df_crashes = sns.load_dataset(name='car_crashes', index_col='abbrev')
df_crashes

df1.jpeg

Check this website to understand the measures of the following data.

Data Preprocessing

From the previous chapter, we should know that we need to preprocess the Data so that variables with different scales can be compared.

For example, it is not the same to increase 1kg of weight than 1m of height.

We will use StandardScaler() algorithm:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data_scaled = scaler.fit_transform(df_crashes)
data_scaled[:5]
array([[ 0.73744574,  1.1681476 ,  0.43993758,  1.00230055,  0.27769155,
        -0.58008306,  0.4305138 ],
       [ 0.56593556,  1.2126951 , -0.21131068,  0.60853209,  0.80725756,
         0.94325764, -0.02289992],
       [ 0.68844283,  0.75670887,  0.18761539,  0.45935701,  1.03314134,
         0.0708756 , -0.98177845],
       [ 1.61949811, -0.48361373,  0.54740815,  1.67605228,  1.95169961,
        -0.33770122,  0.32112519],
       [-0.92865317, -0.39952407, -0.8917629 , -0.594276  , -0.89196792,
        -0.04841772,  1.26617765]])

Let's turn the array into a DataFrame for better understanding:

import pandas as pd

df_scaled = pd.DataFrame(data_scaled, index=df_crashes.index, columns=df_crashes.columns)
df_scaled

df2.jpeg

Now we see all the variables having the same scale (i.e., around the same limits):

df_scaled.agg(['min', 'max'])

df3.jpeg

k-Means Model in Python

We follow the usual Scikit-Learn procedure to develop Machine Learning models.

Import the Class

from sklearn.cluster import KMeans

Instantiate the Class

model_km = KMeans(n_clusters=3)

Fit the Model

model_km.fit(X=df_scaled)
KMeans(n_clusters=3)

Calculate Predictions

model_km.predict(X=df_scaled)
array([1, 1, 1, 1, 2, 0, 2, 1, 2, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 2, 2,
       2, 2, 0, 1, 1, 0, 0, 0, 2, 0, 2, 1, 1, 0, 1, 0, 1, 2, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 1], dtype=int32)

Create a New DataFrame for the Predictions

df_pred = df_scaled.copy()

Create a New Column for the Predictions

df_pred.insert(0, 'pred', model_km.predict(X=df_scaled))
df_pred

df4.jpeg

Visualize the Model

Now let's visualize the clusters with a 2-axis plot:

sns.scatterplot(x='total', y='speeding', hue='pred',
                data=df_pred, palette='Set1');

plot1.jpeg

Model Interpretation

Does the visualization make sense?

  • No, because the clusters should separate their points from others. Nevertheless, we see some green points in the middle of the blue cluster.

kmeans_mixed.jpeg

Why is this happening?

  • We are just representing 2 variables where the model was fitted with 7 variables. We can't see the points separated as we miss 5 variables in the plot.

Why don't we add 5 variables to the plot then?

  • We could, but it'd be a way too hard to interpret.

Then, what could we do?

  • We can apply PCA, a dimensionality reduction technique. Take a look at the following video to understand this concept:

Grouping Variables with PCA()

Transform Data to Components

PCA() is another technique used to transform data.

How has the data been manipulated so far?

  1. Original Data df_crashes
df_crashes

df5.jpeg

  1. Normalized Data df_scaled
df_scaled

df6.jpeg

  1. Principal Components Data dfpca (now)
from sklearn.decomposition import PCA

pca = PCA()
data_pca = pca.fit_transform(df_scaled)
data_pca[:5]
array([[ 1.60367129,  0.13344927,  0.31788093, -0.79529296, -0.57971878,
         0.04622256,  0.21018495],
       [ 1.14421188,  0.85823399,  0.73662642,  0.31898763, -0.22870123,
        -1.00262531,  0.00896585],
       [ 1.43217197, -0.42050562,  0.3381364 ,  0.55251314,  0.16871805,
        -0.80452278, -0.07610742],
       [ 2.49158352,  0.34896812, -1.78874742,  0.26406388, -0.37238226,
        -0.48184939, -0.14763646],
       [-1.75063825,  0.63362517, -0.1361758 , -0.97491605, -0.31581147,
         0.17850962, -0.06895829]])
df_pca = pd.DataFrame(data_pca)
df_pca

df7.jpeg

cols_pca = [f'PC{i}' for i in range(1, pca.n_components_+1)]
cols_pca
['PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6', 'PC7']
df_pca = pd.DataFrame(data_pca, columns=cols_pca, index=df_crashes.index)
df_pca

df8.jpeg

Visualize Components & Clusters

Let's visualize a scatterplot with PC1 & PC2 and colour points by cluster:

import plotly.express as px

px.scatter(data_frame=df_pca, x='PC1', y='PC2', color=df_pred.pred)

Are they mixed now?

  • No, they aren't.

That's because both PC1 and PC2 represent almost 80% of the variability of the original seven variables.

You can see the following array, where every element represents the amount of variability explained by every component:

pca.explained_variance_ratio_
array([0.57342168, 0.22543042, 0.07865743, 0.05007557, 0.04011   ,
       0.02837999, 0.00392491])

And the accumulated variability (79.88% until PC2):

pca.explained_variance_ratio_.cumsum()
array([0.57342168, 0.7988521 , 0.87750953, 0.9275851 , 0.9676951 ,
       0.99607509, 1.        ])

Which variables represent these two components?

Relationship between Original Variables & Components

Loading Vectors

The Principal Components are produced by a mathematical equation (once again), which is composed of the following weights:

df_weights = pd.DataFrame(pca.components_.T, columns=df_pca.columns, index=df_scaled.columns)
df_weights

df9.jpeg

We can observe that:

  1. Socio-demographical features (total, speeding, alcohol, not_distracted & no_previous) have higher coefficients (higher influence) in PC1.
  2. Whereas insurance features (ins_premium & ins_losses) have higher coefficients in PC2.

Principal Components is a technique that gathers the maximum variability of a set of features (variables) into Components.

Therefore, the two first Principal Components accurate a good amount of common data because we see two sets of variables that are correlated with each other:

Correlation Matrix

df_corr = df_scaled.corr()
sns.heatmap(df_corr, annot=True, vmin=0, vmax=1);

plot3.jpeg

I hope that everything is making sense so far.

To ultimate the explanation, you can see below how df_pca values are computed:

Calculating One PCA Value

For example, we can multiply the weights of PC1 with the original variables for ALabama:

(df_weights['PC1']*df_scaled.loc['AL']).sum()
1.6036712920638672

To get the transformed value of the Principal Component 1 for ALabama State:

df_pca.head()

df10.jpeg

The same operation applies to any value of df_pca.

PCA & Cluster Interpretation

Now, let's go back to the PCA plot:

px.scatter(data_frame=df_pca, x='PC1', y='PC2', color=df_pred.pred.astype(str))

How can we interpret the clusters with the components?

Let's add information to the points thanks to animated plots from plotly library:

hover = '''
<b>%{customdata[0]}</b><br><br>
PC1: %{x}<br>
Total: %{customdata[1]}<br>
Alcohol: %{customdata[2]}<br><br>
PC2: %{y}<br>
Ins Losses: %{customdata[3]}<br>
Ins Premium: %{customdata[4]}
'''

fig = px.scatter(data_frame=df_pca, x='PC1', y='PC2',
                 color=df_pred.pred.astype(str),
                 hover_data=[df_pca.index, df_crashes.total, df_crashes.alcohol,
                             df_crashes.ins_losses, df_crashes.ins_premium])

fig.update_traces(hovertemplate = hover)

If you hover the mouse over the two most extreme points along the x-axis, you can see that their values coincide with the min and max values across socio-demographical features:

df_crashes.agg(['min', 'max'])

min_max_x.png

df_crashes.loc[['DC', 'SC'],:]

x-axis.png

Apply the same reasoning over the two most extreme points along the y-axis. You will see the same for the insurance variables because they determine the positioning of the PC2 (y-axis).

df_crashes.agg(['min', 'max'])

min_max_y.png

df_crashes.loc[['ID', 'LA'],:]

y-axis.png

Is there a way to represent the weights of the original data for the Principal Components and the points?

That's called a Biplot, which we will see later.

Biplot

We can observe how we position the points along the loadings vectors. Friendly reminder: the loading vectors are the weights of the original variables in each Principal Component.

import numpy as np

loadings = pca.components_.T * np.sqrt(pca.explained_variance_)
evr = pca.explained_variance_ratio_.round(2)

fig = px.scatter(df_pca, x='PC1', y='PC2',
                 color=model_km.labels_.astype(str),
                 hover_name=df_pca.index,
                labels={
                    'PC1': f'PC1 ~ {evr[0]}%',
                    'PC2': f'PC2 ~ {evr[1]}%'
                })

for i, feature in enumerate(df_scaled.columns):
    fig.add_shape(
        type='line',
        x0=0, y0=0,
        x1=loadings[i, 0],
        y1=loadings[i, 1],
        line=dict(color="red",width=3)
    )
    fig.add_annotation(
        x=loadings[i, 0],
        y=loadings[i, 1],
        ax=0, ay=0,
        xanchor="center",
        yanchor="bottom",
        text=feature,
    )

fig.show()

Conclusion

Dimensionality Reduction techniques have many more applications, but I hope you got the essence: they are great for grouping variables that behave similarly and later visualising many variables in just one component.

In short, you are simplifying the information of the data. In this example, we simplify the data from plotting seven to only two dimensions. Although we don't get this for free because we explain around 80% of the data's original variability.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Did you find this article valuable?

Support Resolving Python by becoming a sponsor. Any amount is appreciated!