# #04 | Overfitting & Hyperparameter Tuning with Cross Validation

## Learn how to compute the best configuration of hyperparameters for the same Machine Learning model/algorithm step-by-step.

### Table of contents

- Chapter Importance
- Load the Data
- Preprocess the Data
- Feature Selection
- Train Test Split
- DecisionTreeClassifier() with Default Hyperparameters
- DecisionTreeClassifier() with Custom Hyperparameters
- GridSearchCV() to find Best Hyperparameters
- Other Models
- KNeighborsClassifier()
- Best Model with Best Hyperparameters

## Chapter Importance

We have already covered:

- Regression Models
- Classification Models
- Train Test Split for Model Selection

In short, we have computed all possible types of models to predict numerical and categorical variables with Regression and Classification models, respectively.

Although it is not enough to compute one model, we need to compare different models to choose the one whose predictions are close to reality.

Nevertheless, we cannot evaluate the model on the same data we used to `.fit()`

(train) the mathematical equation (model). Therefore, we need to separate the data into train and test sets; the first to train the model, the later to evaluate the model.

We add an extra layer of complexity because we can improve a model (an algorithm) by configuring its parameters. This chapter is about **computing different combinations of a single model's hyperparameters** to get the best.

## Load the Data

- The goal of this dataset is
- To predict if
**bank's customers**(rows)`default`

next month - Based on their
**socio-demographical characteristics**(columns)

```
import pandas as pd
pd.set_option("display.max_columns", None)
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls'
df_credit = pd.read_excel(io=url, header=1, index_col=0)
df_credit.sample(10)
```

## Preprocess the Data

### Missing Data

The function `.fit()`

needs all the cells in the DataFrame to contain a value. NaN means "Not a Number" (i.e., cell for which we don't have any information). Otherwise, it won't know how to process the row and compare it to others.

```
df_credit.isna().sum()
```

```
LIMIT_BAL 0
SEX 0
..
PAY_AMT6 0
default payment next month 0
Length: 24, dtype: int64
```

```
df_credit.isna().sum().sum()
```

```
0
```

### Dummy Variables

The function `.fit()`

needs the values to be numeric. Otherwise, Python won't know the position of the axes in which to allocate the point.

Therefore, categories of the categorical columns will be transformed into new columns (one new column per category) and contain 1s and 0s depending on whether the person is or is not in the category.

Nevertheless, **we don't need to create dummy variables** because the data contains numerical variables only.

## Feature Selection

So far, we have used the naming standard of **target** and **features**. Nevertheless, the most common standards on the Internet are **X** and **y**. Let's get used to it:

```
y = df_credit.iloc[:, -1]
X = df_credit.iloc[:, :-1]
```

## Train Test Split

From the previous chapter, we should already know we need to separate the data into train and test if we want to evaluate the model's predictive capability for data we don't know yet.

In our case, we'd like to predict if new credit card customers won't commit default in the next month. As we don't have the data for the next month (it's the future), we need to apply the function `train_test_split()`

.

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
```

`DecisionTreeClassifier()`

with Default Hyperparameters

To compute a Machine Learning model with the **default hyperparameters**, we apply the same procedure we have covered in previous chapters:

```
from sklearn.tree import DecisionTreeClassifier
model_dt = DecisionTreeClassifier()
model_dt.fit(X_train, y_train)
```

```
DecisionTreeClassifier()
```

### Accuracy

We can see the model is almost perfect for predicting the training data (99% of accuracy). Nevertheless, predicting test data is terrible (72% of accuracy). This phenomenon tells us that the model is incurring in **overfitting**.

#### In `train`

data

```
model_dt.score(X_train, y_train)
```

```
0.9995024875621891
```

#### In `test`

data

```
model_dt.score(X_test, y_test)
```

```
0.7265656565656565
```

### Model Visualization

I'll use the following visualization to explain the concept of overfitting.

```
from sklearn.tree import plot_tree
plot_tree(decision_tree=model_dt, feature_names=X_train.columns, filled=True);
```

The tree is big because we have a lot of people (20,100), and we haven't set any limit on the model.

How many people do you think we have in the deepest leaf?

- Very few, probably one.

Are these people characteristic of the overall data? Or are they infrequent?

- Because they are infrequent and the model is very complex, we are incurring overfitting, and we get a vast difference between train and test accuracies.

`DecisionTreeClassifier()`

with Custom Hyperparameters

Which hyperparameters can we configure for the Decision Tree algorithm?

In the output below, we can configure parameters such as `max_depth`

, `criterion`

and `min_samples_leaf`

, among others.

```
model = DecisionTreeClassifier()
model.get_params()
```

```
{'ccp_alpha': 0.0,
'class_weight': None,
'criterion': 'gini',
'max_depth': None,
'max_features': None,
'max_leaf_nodes': None,
'min_impurity_decrease': 0.0,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'random_state': None,
'splitter': 'best'}
```

Let's apply different random configurations to see how to model's accuracy changes in train and test sets.

Please pay attention to how the accuracies are similar when we reduce the model's complexity (we make the tree shorter and generalized to capture more people in the leaves).

And remember that we should pick up a good configuration based on the test accuracy.

### 1st Configuration

```
model_dt = DecisionTreeClassifier(max_depth=2, min_samples_leaf=150)
model_dt.fit(X_train, y_train)
```

```
DecisionTreeClassifier(max_depth=2, min_samples_leaf=150)
```

#### Accuracy

##### In `train`

data

```
model_dt.score(X_train, y_train)
```

```
0.8186567164179105
```

##### In `test`

data

```
model_dt.score(X_test, y_test)
```

```
0.8215151515151515
```

#### Model Visualization

```
plot_tree(decision_tree=model_dt, feature_names=X_train.columns, filled=True);
```

### 2nd Configuration

```
model_dt = DecisionTreeClassifier(max_depth=3)
model_dt.fit(X_train, y_train)
```

```
DecisionTreeClassifier(max_depth=3)
```

#### Accuracy

##### In `train`

data

```
model_dt.score(X_train, y_train)
```

```
0.8207960199004976
```

##### In `test`

data

```
model_dt.score(X_test, y_test)
```

```
0.8222222222222222
```

#### Model Visualization

```
plot_tree(decision_tree=model_dt, feature_names=X_train.columns, filled=True);
```

### 3rd Configuration

```
model_dt = DecisionTreeClassifier(max_depth=4)
model_dt.fit(X_train, y_train)
```

```
DecisionTreeClassifier(max_depth=4)
```

#### Accuracy

##### In `train`

data

```
model_dt.score(X_train, y_train)
```

```
0.8232338308457712
```

##### In `test`

data

```
model_dt.score(X_test, y_test)
```

```
0.8205050505050505
```

#### Model Visualization

```
plot_tree(decision_tree=model_dt, feature_names=X_train.columns, filled=True);
```

### 4th Configuration

#### Accuracy

```
model_dt = DecisionTreeClassifier(min_samples_leaf=100)
model_dt.fit(X_train, y_train)
```

```
DecisionTreeClassifier(min_samples_leaf=100)
```

##### In `train`

data

```
model_dt.score(X_train, y_train)
```

```
0.8244278606965174
```

##### In `test`

data

```
model_dt.score(X_test, y_test)
```

```
0.8161616161616162
```

#### Model Visualization

```
plot_tree(decision_tree=model_dt, feature_names=X_train.columns, filled=True);
```

### 5th Configuration

```
model_dt = DecisionTreeClassifier(max_depth=7, min_samples_leaf=100)
model_dt.fit(X_train, y_train)
```

```
DecisionTreeClassifier(max_depth=7, min_samples_leaf=100)
```

#### Accuracy

##### In `train`

data

```
model_dt.score(X_train, y_train)
```

```
0.8237313432835821
```

##### In `test`

data

```
model_dt.score(X_test, y_test)
```

```
0.8177777777777778
```

#### Model Visualization

```
plot_tree(decision_tree=model_dt, feature_names=X_train.columns, filled=True);
```

We have similar results; the accuracy goes around 82% on the test set when we configure a general model which doesn't have a considerable depth (as the first one).

But we should ask ourselves another question: can we do this process of automatically checking multiple combinations of hyperparameters?

- Yes, and that's where
**Cross Validation**gets in.

`GridSearchCV()`

to find Best Hyperparameters

The Cross-Validation technique splits the training data into n number of folds (5 in the image below). Then, it computes each hyperparameter configuration n times, where each fold will be taken as a test set once.

Consider that we `.fit()`

a model as many times as folds are multiplied by the number of combinations we want to try.

Out of the Decision Tree hyperparameters:

```
model_dt = DecisionTreeClassifier()
model_dt.get_params()
```

```
{'ccp_alpha': 0.0,
'class_weight': None,
'criterion': 'gini',
'max_depth': None,
'max_features': None,
'max_leaf_nodes': None,
'min_impurity_decrease': 0.0,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'random_state': None,
'splitter': 'best'}
```

We want to try the following combinations of `max_depth (6)`

, `min_samples_leaf (7)`

and `criterion (2)`

:

```
from sklearn.model_selection import GridSearchCV
param_grid = {
'max_depth': [None, 2, 3, 4, 5, 10],
'min_samples_leaf': [1, 50, 100, 200, 400, 800, 1600],
'criterion': ['gini', 'entropy']
}
cv_dt = GridSearchCV(estimator=model_dt, param_grid=param_grid, cv=5, verbose=1)
```

They make up to 420 times we compute the function`.fit()`

```
5*6*7*2
```

```
420
```

To compare 84 different combinations of the Decision Tree hyperparameters:

```
6*7*2
```

```
84
```

```
cv_dt.fit(X_train, y_train)
```

```
Fitting 5 folds for each of 84 candidates, totalling 420 fits
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(),
param_grid={'criterion': ['gini', 'entropy'],
'max_depth': [None, 2, 3, 4, 5, 10],
'min_samples_leaf': [1, 50, 100, 200, 400, 800, 1600]},
verbose=1)
```

If we specify `verbose=2`

, we will see how many fits we perform in the output:

```
cv_dt = GridSearchCV(estimator=model_dt, param_grid=param_grid, cv=5, verbose=2)
cv_dt.fit(X_train, y_train)
```

```
Fitting 5 folds for each of 84 candidates, totalling 420 fits
[CV] END .criterion=gini, max_depth=None, min_samples_leaf=1; total time= 0.2s
[CV] END .criterion=gini, max_depth=None, min_samples_leaf=1; total time= 0.2s
...
[CV] END criterion=entropy, max_depth=10, min_samples_leaf=1600; total time= 0.1s
[CV] END criterion=entropy, max_depth=10, min_samples_leaf=1600; total time= 0.1s
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(),
param_grid={'criterion': ['gini', 'entropy'],
'max_depth': [None, 2, 3, 4, 5, 10],
'min_samples_leaf': [1, 50, 100, 200, 400, 800, 1600]},
verbose=2)
```

The best hyperparameter configuration is:

```
cv_dt.best_params_
```

```
DecisionTreeClassifier(criterion='entropy', max_depth=5, min_samples_leaf=100)
```

To achieve accuracy on the test set of:

```
cv_dt.score(X_test, y_test)
```

```
0.8186868686868687
```

If we'd like to have the results of every configuration:

```
df_cv_dt = pd.DataFrame(cv_dt.cv_results_)
df_cv_dt
```

## Other Models

Now let's try to find the best hyperparameter configuration of other models, which don't have the same hyperparameters as the Decision Tree because their algorithm and mathematical equation are different.

### Support Vector Machines `SVC()`

Before computing the Support Vector Machines model, we need to scale the data because this model compares the distance between the explanatory variables. Therefore, they all need to be on the same scale.

```
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_norm = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
```

We need to separate the data again to have the train and test sets with the scaled data:

```
>>> X_norm_train, X_norm_test, y_train, y_test = train_test_split(
... X_norm, y, test_size=0.33, random_state=42)
```

The Support Vector Machines contain the following hyperparameters:

```
from sklearn.svm import SVC
sv = SVC()
sv.get_params()
```

```
{'C': 1.0,
'break_ties': False,
'cache_size': 200,
'class_weight': None,
'coef0': 0.0,
'decision_function_shape': 'ovr',
'degree': 3,
'gamma': 'scale',
'kernel': 'rbf',
'max_iter': -1,
'probability': False,
'random_state': None,
'shrinking': True,
'tol': 0.001,
'verbose': False}
```

From which we want to try the following combinations:

```
param_grid = {
'C': [0.1, 1, 10],
'kernel': ['linear', 'rbf']
}
cv_sv = GridSearchCV(estimator=sv, param_grid=param_grid, verbose=2)
cv_sv.fit(X_norm_train, y_train)
```

```
Fitting 5 folds for each of 6 candidates, totalling 30 fits
[CV] END ...............................C=0.1, kernel=linear; total time= 3.0s
[CV] END ...............................C=0.1, kernel=linear; total time= 3.0s
...
[CV] END ...................................C=10, kernel=rbf; total time= 5.3s
[CV] END ...................................C=10, kernel=rbf; total time= 5.3s
GridSearchCV(estimator=SVC(),
param_grid={'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']},
verbose=2)
```

We should notice that some fits take up to almost 5 seconds, which is very time-expensive if we want to try thousands of combinations (professionals apply these practices). Therefore, we should know how the model's algorithm works inside to choose a good hyperparameter configuration that doesn't devote much time. Otherwise, we make the company spend a lot of money on computing power.

This tutorial dissects the Support Vector Machines algorithm works inside.

The best hyperparameter configuration is:

```
cv_sv.best_params_
```

```
SVC(C=10)
```

To achieve an accuracy on the test set of:

```
cv_sv.score(X_norm_test, y_test)
```

```
0.8185858585858586
```

If we'd like to have the results of every configuration:

```
df_cv_sv = pd.DataFrame(cv_sv.cv_results_)
df_cv_sv
```

`KNeighborsClassifier()`

Now we'll compute another classification model: K Nearest Neighbours.

We check for its hyperparameters:

```
from sklearn.neighbors import KNeighborsClassifier
model_kn = KNeighborsClassifier()
model_kn.get_params()
```

```
{'algorithm': 'auto',
'leaf_size': 30,
'metric': 'minkowski',
'metric_params': None,
'n_jobs': None,
'n_neighbors': 5,
'p': 2,
'weights': 'uniform'}
```

To choose the following combinations:

```
param_grid = {
'leaf_size': [10, 20, 30, 50],
'metric': ['minkowski', 'euclidean', 'manhattan'],
'n_neighbors': [3, 5, 10, 20]
}
cv_kn = GridSearchCV(estimator=kn, param_grid=param_grid, verbose=2)
cv_kn.fit(X_norm_train, y_train)
```

```
Fitting 5 folds for each of 48 candidates, totalling 240 fits
[CV] END ......leaf_size=10, metric=minkowski, n_neighbors=3; total time= 1.5s
[CV] END ......leaf_size=10, metric=minkowski, n_neighbors=3; total time= 1.3s
...
[CV] END .....leaf_size=50, metric=manhattan, n_neighbors=20; total time= 1.1s
[CV] END .....leaf_size=50, metric=manhattan, n_neighbors=20; total time= 1.1s
GridSearchCV(estimator=KNeighborsClassifier(),
param_grid={'leaf_size': [10, 20, 30, 50],
'metric': ['minkowski', 'euclidean', 'manhattan'],
'n_neighbors': [3, 5, 10, 20]},
verbose=2)
```

The best hyperparameter configuration is:

```
cv_kn.best_params_
```

```
KNeighborsClassifier(leaf_size=10, n_neighbors=20)
```

To achieve an accuracy on the test set of:

```
cv_kn.score(X_norm_test, y_test)
```

```
0.8185858585858586
```

If we'd like to have the results of every configuration:

```
df_cv_kn = pd.DataFrame(cv_kn.cv_results_)
df_cv_kn
```

## Best Model with Best Hyperparameters

The best algorithm at its best is the Decision Tree Classifier:

```
dic_results = {
'model': [
cv_dt.best_estimator_,
cv_sv.best_estimator_,
cv_kn.best_estimator_
],
'hyperparameters': [
cv_dt.best_params_,
cv_sv.best_params_,
cv_kn.best_params_
],
'score': [
cv_dt.score(X_test, y_test),
cv_sv.score(X_norm_test, y_test),
cv_kn.score(X_norm_test, y_test)
]
}
df_cv_comp = pd.DataFrame(dic_results)
df_cv_comp.style.background_gradient()
```

### Did you find this article valuable?

Support **Jesús López** by becoming a sponsor. Any amount is appreciated!