# #06 | The Principal Component Analysis (PCA) & Dimensionality Reduction Techniques

Nov 8, 2021·

## Chapter Importance

We used just two variables out of the seven we had in the whole DataFrame.

We could have computed better cluster models by giving more information to the Machine Learning model. Nevertheless, it would have been harder to plot seven variables with seven axes in a graph.

Is there anything we can do to compute a clustering model with more than two variables and later represent all the points along with their variables?

• Yes, everything is possible with data. As one of my teachers told me: "you can torture the data until it gives you what you want" (sometimes it's unethical, so behave).

We'll develop the code to show you the need for dimensionality reduction techniques. Specifically, the Principal Component Analysis (PCA).

Imagine for a second you are the president of the United States of America, and you are considering creating campaigns to reduce car accidents.

You won't create 51 TV campaigns, one for each of the States of the USA (rows). Instead, you will see which States behave similarly to cluster them into 3 groups based on the variation across their features (columns).

``````import seaborn as sns #!

df_crashes
``````

Check this website to understand the measures of the following data.

## Data Preprocessing

From the previous chapter, we should know that we need to preprocess the Data so that variables with different scales can be compared.

For example, it is not the same to increase 1kg of weight than 1m of height.

We will use `StandardScaler()` algorithm:

``````from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data_scaled = scaler.fit_transform(df_crashes)
data_scaled[:5]
``````
``````array([[ 0.73744574,  1.1681476 ,  0.43993758,  1.00230055,  0.27769155,
-0.58008306,  0.4305138 ],
[ 0.56593556,  1.2126951 , -0.21131068,  0.60853209,  0.80725756,
0.94325764, -0.02289992],
[ 0.68844283,  0.75670887,  0.18761539,  0.45935701,  1.03314134,
0.0708756 , -0.98177845],
[ 1.61949811, -0.48361373,  0.54740815,  1.67605228,  1.95169961,
-0.33770122,  0.32112519],
[-0.92865317, -0.39952407, -0.8917629 , -0.594276  , -0.89196792,
-0.04841772,  1.26617765]])
``````

Let's turn the array into a DataFrame for better understanding:

``````import pandas as pd

df_scaled = pd.DataFrame(data_scaled, index=df_crashes.index, columns=df_crashes.columns)
df_scaled
``````

Now we see all the variables having the same scale (i.e., around the same limits):

``````df_scaled.agg(['min', 'max'])
``````

## k-Means Model in Python

We follow the usual Scikit-Learn procedure to develop Machine Learning models.

### Import the Class

``````from sklearn.cluster import KMeans
``````

### Instantiate the Class

``````model_km = KMeans(n_clusters=3)
``````

### Fit the Model

``````model_km.fit(X=df_scaled)
``````
``````KMeans(n_clusters=3)
``````

### Calculate Predictions

``````model_km.predict(X=df_scaled)
``````
``````array([1, 1, 1, 1, 2, 0, 2, 1, 2, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 2, 2,
2, 2, 0, 1, 1, 0, 0, 0, 2, 0, 2, 1, 1, 0, 1, 0, 1, 2, 1, 1, 1, 1,
0, 0, 0, 0, 1, 0, 1], dtype=int32)
``````

### Create a New DataFrame for the Predictions

``````df_pred = df_scaled.copy()
``````

### Create a New Column for the Predictions

``````df_pred.insert(0, 'pred', model_km.predict(X=df_scaled))
df_pred
``````

### Visualize the Model

Now let's visualize the clusters with a 2-axis plot:

``````sns.scatterplot(x='total', y='speeding', hue='pred',
data=df_pred, palette='Set1');
``````

### Model Interpretation

Does the visualization make sense?

• No, because the clusters should separate their points from others. Nevertheless, we see some green points in the middle of the blue cluster.

Why is this happening?

• We are just representing 2 variables where the model was fitted with 7 variables. We can't see the points separated as we miss 5 variables in the plot.

Why don't we add 5 variables to the plot then?

• We could, but it'd be a way too hard to interpret.

Then, what could we do?

• We can apply PCA, a dimensionality reduction technique. Take a look at the following video to understand this concept:

## Grouping Variables with `PCA()`

### Transform Data to Components

`PCA()` is another technique used to transform data.

How has the data been manipulated so far?

1. Original Data `df_crashes`
``````df_crashes
``````

1. Normalized Data `df_scaled`
``````df_scaled
``````

1. Principal Components Data `dfpca` (now)
``````from sklearn.decomposition import PCA

pca = PCA()
data_pca = pca.fit_transform(df_scaled)
data_pca[:5]
``````
``````array([[ 1.60367129,  0.13344927,  0.31788093, -0.79529296, -0.57971878,
0.04622256,  0.21018495],
[ 1.14421188,  0.85823399,  0.73662642,  0.31898763, -0.22870123,
-1.00262531,  0.00896585],
[ 1.43217197, -0.42050562,  0.3381364 ,  0.55251314,  0.16871805,
-0.80452278, -0.07610742],
[ 2.49158352,  0.34896812, -1.78874742,  0.26406388, -0.37238226,
-0.48184939, -0.14763646],
[-1.75063825,  0.63362517, -0.1361758 , -0.97491605, -0.31581147,
0.17850962, -0.06895829]])
``````
``````df_pca = pd.DataFrame(data_pca)
df_pca
``````

``````cols_pca = [f'PC{i}' for i in range(1, pca.n_components_+1)]
cols_pca
``````
``````['PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6', 'PC7']
``````
``````df_pca = pd.DataFrame(data_pca, columns=cols_pca, index=df_crashes.index)
df_pca
``````

### Visualize Components & Clusters

Let's visualize a scatterplot with `PC1` & `PC2` and colour points by cluster:

``````import plotly.express as px

px.scatter(data_frame=df_pca, x='PC1', y='PC2', color=df_pred.pred)
``````

Are they mixed now?

• No, they aren't.

That's because both PC1 and PC2 represent almost 80% of the variability of the original seven variables.

You can see the following array, where every element represents the amount of variability explained by every component:

``````pca.explained_variance_ratio_
``````
``````array([0.57342168, 0.22543042, 0.07865743, 0.05007557, 0.04011   ,
0.02837999, 0.00392491])
``````

And the accumulated variability (79.88% until PC2):

``````pca.explained_variance_ratio_.cumsum()
``````
``````array([0.57342168, 0.7988521 , 0.87750953, 0.9275851 , 0.9676951 ,
0.99607509, 1.        ])
``````

Which variables represent these two components?

## Relationship between Original Variables & Components

The Principal Components are produced by a mathematical equation (once again), which is composed of the following weights:

``````df_weights = pd.DataFrame(pca.components_.T, columns=df_pca.columns, index=df_scaled.columns)
df_weights
``````

We can observe that:

1. Socio-demographical features (total, speeding, alcohol, not_distracted & no_previous) have higher coefficients (higher influence) in PC1.
2. Whereas insurance features (ins_premium & ins_losses) have higher coefficients in PC2.

Principal Components is a technique that gathers the maximum variability of a set of features (variables) into Components.

Therefore, the two first Principal Components accurate a good amount of common data because we see two sets of variables that are correlated with each other:

### Correlation Matrix

``````df_corr = df_scaled.corr()
sns.heatmap(df_corr, annot=True, vmin=0, vmax=1);
``````

I hope that everything is making sense so far.

To ultimate the explanation, you can see below how `df_pca` values are computed:

### Calculating One PCA Value

For example, we can multiply the weights of PC1 with the original variables for ALabama:

``````(df_weights['PC1']*df_scaled.loc['AL']).sum()
``````
``````1.6036712920638672
``````

To get the transformed value of the Principal Component 1 for ALabama State:

``````df_pca.head()
``````

The same operation applies to any value of `df_pca`.

## PCA & Cluster Interpretation

Now, let's go back to the PCA plot:

``````px.scatter(data_frame=df_pca, x='PC1', y='PC2', color=df_pred.pred.astype(str))
``````

How can we interpret the clusters with the components?

Let's add information to the points thanks to animated plots from `plotly` library:

``````hover = '''
<b>%{customdata[0]}</b><br><br>
PC1: %{x}<br>
Total: %{customdata[1]}<br>
Alcohol: %{customdata[2]}<br><br>
PC2: %{y}<br>
Ins Losses: %{customdata[3]}<br>
'''

fig = px.scatter(data_frame=df_pca, x='PC1', y='PC2',
color=df_pred.pred.astype(str),
hover_data=[df_pca.index, df_crashes.total, df_crashes.alcohol,

fig.update_traces(hovertemplate = hover)
``````

If you hover the mouse over the two most extreme points along the x-axis, you can see that their values coincide with the `min` and `max` values across socio-demographical features:

``````df_crashes.agg(['min', 'max'])
``````

``````df_crashes.loc[['DC', 'SC'],:]
``````

Apply the same reasoning over the two most extreme points along the y-axis. You will see the same for the insurance variables because they determine the positioning of the PC2 (y-axis).

``````df_crashes.agg(['min', 'max'])
``````

``````df_crashes.loc[['ID', 'LA'],:]
``````

Is there a way to represent the weights of the original data for the Principal Components and the points?

That's called a Biplot, which we will see later.

### Biplot

We can observe how we position the points along the loadings vectors. Friendly reminder: the loading vectors are the weights of the original variables in each Principal Component.

``````import numpy as np

evr = pca.explained_variance_ratio_.round(2)

fig = px.scatter(df_pca, x='PC1', y='PC2',
color=model_km.labels_.astype(str),
hover_name=df_pca.index,
labels={
'PC1': f'PC1 ~ {evr[0]}%',
'PC2': f'PC2 ~ {evr[1]}%'
})

for i, feature in enumerate(df_scaled.columns):
type='line',
x0=0, y0=0,
line=dict(color="red",width=3)
)
ax=0, ay=0,
xanchor="center",
yanchor="bottom",
text=feature,
)

fig.show()
``````

## Conclusion

Dimensionality Reduction techniques have many more applications, but I hope you got the essence: they are great for grouping variables that behave similarly and later visualising many variables in just one component.

In short, you are simplifying the information of the data. In this example, we simplify the data from plotting seven to only two dimensions. Although we don't get this for free because we explain around 80% of the data's original variability.