# #02 | The Decision Tree Classifier & Supervised Classification Models

## This tutorial shows you the step-by-step resolution of possible errors you may get as you develop a Decision Tree Classifier.

Jesús López
·May 6, 2022·

Featured on Hashnode

Subscribe to my newsletter and never miss my upcoming articles

• Introduction to Supervised Classification Models
• How do we compute a Decision Tree Model in Python?
• Data Preprocessing
• Other Classification Models
• Which One Is the Best Model? Why?

Don't miss out on his posts on LinkedIn to become a more efficient Python developer.

## Introduction to Supervised Classification Models

Machine Learning is a field that focuses on getting a mathematical equation to make predictions. Although not all Machine Learning models work the same way.

Which types of Machine Learning models can we distinguish so far?

• Classifiers to predict Categorical Variables
• Regressors to predict Numerical Variables

The previous chapter covered the explanation of a Regressor model: Linear Regression.

This chapter covers the explanation of a Classification model: the Decision Tree.

Why do they belong to Machine Learning?

• The Machine wants to get the best numbers of a mathematical equation such that the difference between reality and predictions is minimum:

• Classifier evaluates the model based on prediction success rate $$y \stackrel{?}{=} \hat y$$
• Regressor evaluates the model based on the distance between real data and predictions (residuals) $$y - \hat y$$

There are many Machine Learning Models of each type.

You don't need to know the process behind each model because they all work the same way (see article). In the end, you will choose the one that makes better predictions.

This tutorial will show you how to develop a Decision Tree to calculate the probability of a person surviving the Titanic and the different evaluation metrics we can calculate on Classification Models.

Table of Important Content

1. 🛀 How to preprocess/clean the data to fit a Machine Learning model?
• Dummy Variables
• Missing Data
2. 🤩 How to visualize a Decision Tree model in Python step by step?
3. 🤔 How to interpret the nodes and leaf's values of a Decision Tree plot?
4. ⚠️ How to evaluate Classification models?
5. 🏁 How to compare Classification models to choose the best one?

• This dataset represents people (rows) aboard the Titanic
• And their sociological characteristics (columns)
import seaborn as sns #!
import pandas as pd

df_titanic = sns.load_dataset(name='titanic')[['survived', 'sex', 'age', 'embarked', 'class']]
df_titanic


## How do we compute a Decision Tree Model in Python?

We should know from the previous chapter that we need a function accessible from a Class in the library sklearn.

### Import the Class

from sklearn.tree import DecisionTreeClassifier


### Instantiante the Class

To create a copy of the original's code blueprint to not "modify" the source code.

model_dt = DecisionTreeClassifier()


### Access the Function

The theoretical action we'd like to perform is the same as we executed in the previous chapter. Therefore, the function should be called the same way:

model_dt.fit()

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

/var/folders/24/tg28vxls25l9mjvqrnh0plc80000gn/T/ipykernel_3553/3699705032.py in <module>
----> 1 model_dt.fit()

TypeError: fit() missing 2 required positional arguments: 'X' and 'y'


Why is it asking for two parameters: y and X?

• y: target ~ independent ~ label ~ class variable
• X: explanatory ~ dependent ~ feature variables

### Separate the Variables

target = df_titanic['survived']
explanatory = df_titanic.drop(columns='survived')


### Fit the Model

model_dt.fit(X=explanatory, y=target)

---------------------------------------------------------------------------

ValueError: could not convert string to float: 'male'


Most of the time, the data isn't prepared to fit the model. So let's dig into why we got the previous error in the following sections.

## Data Preprocessing

The error says:

ValueError: could not convert string to float: 'male'


From which we can interpret that the function .fit() does not accept values of string type like the ones in sex column:

df_titanic


### Dummy Variables

Therefore, we need to convert the categorical columns to dummies (0s & 1s):

pd.get_dummies(df_titanic, drop_first=True)


df_titanic = pd.get_dummies(df_titanic, drop_first=True)


We separate the variables again to take into account the latest modification:

explanatory = df_titanic.drop(columns='survived')
target = df_titanic[['survived']]


### Fit the Model Again

Now we should be able to fit the model:

model_dt.fit(X=explanatory, y=target)

---------------------------------------------------------------------------

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').


### Missing Data

The data passed to the function contains missing data (NaN). Precisely 177 people from which we don't have the age:

df_titanic.isna()


df_titanic.isna().sum()

survived          0
age             177
sex_male          0
embarked_Q        0
embarked_S        0
class_Second      0
class_Third       0
dtype: int64


Who are the people who lack the information?

mask_na = df_titanic.isna().sum(axis=1) > 0

df_titanic[mask_na]


What could we do with them?

1. Drop the people (rows) who miss the age from the dataset.
2. Fill the age by the average age of other combinations (like males who survived)
3. Apply an algorithm to fill them.

We'll choose option 1 to simplify the tutorial.

Therefore, we go from 891 people:

df_titanic


To 714 people:

df_titanic.dropna()


df_titanic = df_titanic.dropna()


We separate the variables again to take into account the latest modification:

explanatory = df_titanic.drop(columns='survived')
target = df_titanic['survived']


Now we shouldn't have any more trouble with the data to fit the model.

### Fit the Model Again

We don't get any errors because we correctly preprocess the data for the model.

Once the model is fitted, we may observe that the object contains more attributes because it has calculated the best numbers for the mathematical equation.

model_dt.fit(X=explanatory, y=target)
model_dt.__dict__

{'criterion': 'gini',
'splitter': 'best',
'max_depth': None,
'min_samples_split': 2,
'min_samples_leaf': 1,
'min_weight_fraction_leaf': 0.0,
'max_features': None,
'max_leaf_nodes': None,
'random_state': None,
'min_impurity_decrease': 0.0,
'class_weight': None,
'ccp_alpha': 0.0,
'feature_names_in_': array(['age', 'sex_male', 'embarked_Q', 'embarked_S', 'class_Second',
'class_Third'], dtype=object),
'n_features_in_': 6,
'n_outputs_': 1,
'classes_': array([0, 1]),
'n_classes_': 2,
'max_features_': 6,
'tree_': <sklearn.tree._tree.Tree at 0x16612cce0>}


### Predictions

#### Calculate Predictions

We have a fitted DecisionTreeClassifier. Therefore, we should be able to apply the mathematical equation to the original data to get the predictions:

model_dt.predict_proba(X=explanatory)[:5]

array([[0.82051282, 0.17948718],
[0.05660377, 0.94339623],
[0.53921569, 0.46078431],
[0.05660377, 0.94339623],
[0.82051282, 0.17948718]])


#### Add a New Column with the Predictions

Let's create a new DataFrame to keep the information of the target and predictions to understand the topic better:

df_pred = df_titanic[['survived']].copy()


df_pred['pred_proba_dt'] = model_dt.predict_proba(X=explanatory)
df_pred


How have we calculated those predictions?

### Model Visualization

The Decision Tree model doesn't specifically have a mathematical equation. But instead, a set of conditions is represented in a tree:

from sklearn.tree import plot_tree

plot_tree(decision_tree=model_dt);


There are many conditions; let's recreate a shorter tree to explain the Mathematical Equation of the Decision Tree:

model_dt = DecisionTreeClassifier(max_depth=2)
model_dt.fit(X=explanatory, y=target)

plot_tree(decision_tree=model_dt);


Let's make the image bigger:

import matplotlib.pyplot as plt

plt.figure(figsize=(10,6))
plot_tree(decision_tree=model_dt);


The conditions are X[2]<=0.5. The X[2] means the 3rd variable (Python starts counting at 0) of the explanatory ones. If we'd like to see the names of the columns, we need to add the feature_names parameter:

explanatory.columns

Index(['age', 'sex_male', 'embarked_Q', 'embarked_S', 'class_Second',
'class_Third'],
dtype='object')

import matplotlib.pyplot as plt

plt.figure(figsize=(10,6))
plot_tree(decision_tree=model_dt, feature_names=explanatory.columns);


Let's add some colours to see how the predictions will go based on the fulfilled conditions:

import matplotlib.pyplot as plt

plt.figure(figsize=(10,6))
plot_tree(decision_tree=model_dt, feature_names=explanatory.columns, filled=True);


### How does the Decision Tree Algorithm computes the Mathematical Equation?

The Decision Tree and the Linear Regression algorithms look for the best numbers in a mathematical equation. The following video explains how the Decision Tree configures the equation:

### Model Interpretation

Let's take a person from the data to explain how the model makes a prediction. For storytelling, let's say the person's name is John.

John is a 22-year-old man who took the titanic on 3rd class but didn't survive:

df_titanic[:1]


To calculate the chances of survival in a person like John, we pass the explanatory variables of John:

explanatory[:1]


To the function .predict_proba() and get a probability of 17.94%:

model_dt.predict_proba(X=explanatory[:1])

array([[0.82051282, 0.17948718]])


But wait, how did we get to the probability of survival of 17.94%?

Let's explain it step-by-step with the Decision Tree visualization:

plt.figure(figsize=(10,6))
plot_tree(decision_tree=model_dt, feature_names=explanatory.columns, filled=True);


Based on the tree, the conditions are:

#### 1st condition

• sex_male (John=1) <= 0.5 ~ False

John doesn't fulfil the condition; we move to the right side of the tree.

#### 2nd condition

• age (John=22.0) <= 6.5 ~ False

John doesn't fulfil the condition; we move to the right side of the tree.

#### Leaf

The ultimate node, the leaf, tells us that the training dataset contained 429 males older than 6.5 years old.

Out of the 429, 77 survived, but 352 didn't make it.

Therefore, the chances of John surviving according to our model are 77 divided by 429:

77/429

0.1794871794871795


We get the same probability; John had a 17.94% chance of surviving the Titanic accident.

### Model's Score

#### Calculate the Score

As always, we should have a function to calculate the goodness of the model:

model_dt.score(X=explanatory, y=target)

0.8025210084033614


The model can correctly predict 80.25% of the people in the dataset.

What's the reasoning behind the model's evaluation?

#### The Score Step-by-step

As we saw earlier, the classification model calculates the probability for an event to occur. The function .predict_proba() gives us two probabilities in the columns: people who didn't survive (0) and people who survived (1).

model_dt.predict_proba(X=explanatory)[:5]

array([[0.82051282, 0.17948718],
[0.05660377, 0.94339623],
[0.53921569, 0.46078431],
[0.05660377, 0.94339623],
[0.82051282, 0.17948718]])


We take the positive probabilities in the second column:

df_pred['pred_proba_dt'] = model_dt.predict_proba(X=explanatory)[:, 1]


At the time to compare reality (0s and 1s) with the predictions (probabilities), we need to turn probabilities higher than 0.5 into 1, and 0 otherwise.

import numpy as np

df_pred['pred_dt'] = np.where(df_pred.pred_proba_dt > 0.5, 1, 0)
df_pred


The simple idea of the accuracy is to get the success rate on the classification: how many people do we get right?

We compare if the reality is equal to the prediction:

comp = df_pred.survived == df_pred.pred_dt
comp

0       True
1       True
...
889    False
890     True
Length: 714, dtype: bool


If we sum the boolean Series, Python will take True as 1 and 0 as False to compute the number of correct classifications:

comp.sum()

573


We get the score by dividing the successes by all possibilities (the total number of people):

comp.sum()/len(comp)

0.8025210084033614


It is also correct to do the mean on the comparisons because it's the sum divided by the total. Observe how you get the same number:

comp.mean()

0.8025210084033614


But it's more efficient to calculate this metric with the function .score():

model_dt.score(X=explanatory, y=target)

0.8025210084033614


### The Confusion Matrix to Compute Other Classification Metrics

Can we think that our model is 80.25% of good and be happy with it?

• We should not because we might be interested in the accuracy of each class (survived or not) separately. But first, we need to compute the confusion matrix:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(
y_true=df_pred.survived,
y_pred=df_pred.pred_dt
)

CM = ConfusionMatrixDisplay(cm)
CM.plot();


1. Looking at the first number of the confusion matrix, we have 407 people who didn't survive the Titanic in reality and the predictions.
2. It is not the case with the number 17. Our model classified 17 people as survivors when they didn't.
3. The success rate of the negative class, people who didn't survive, is called the specificity: $407/(407+17)$.
4. Whereas the success rate of the positive class, people who did survive, is called the sensitivity: $166/(166+124)$.

#### Specificity (Recall=0)

cm[0,0]

407

cm[0,:]

array([407,  17])

cm[0,0]/cm[0,:].sum()

0.9599056603773585

sensitivity = cm[0,0]/cm[0,:].sum()


#### Sensitivity (Recall=1)

cm[1,1]

166

cm[1,:]

array([124, 166])

cm[1,1]/cm[1,:].sum()

0.5724137931034483

sensitivity = cm[1,1]/cm[1,:].sum()


#### Classification Report

We could have gotten the same metrics using the function classification_report(). Look a the recall (column) of rows 0 and 1, specificity and sensitivity, respectively:

from sklearn.metrics import classification_report

report = classification_report(
y_true=df_pred.survived,
y_pred=df_pred.pred_dt
)

print(report)

              precision    recall  f1-score   support

0       0.77      0.96      0.85       424
1       0.91      0.57      0.70       290

accuracy                           0.80       714
macro avg       0.84      0.77      0.78       714
weighted avg       0.82      0.80      0.79       714


We can also create a nice DataFrame to later use the data for simulations:

report = classification_report(
y_true=df_pred.survived,
y_pred=df_pred.pred_dt,
output_dict=True
)

pd.DataFrame(report)


Our model is not as good as we thought if we predict the people who survived; we get 57.24% of survivors right.

How can we then assess a reasonable rate for our model?

#### ROC Curve

Watch the following video to understand how the Area Under the Curve (AUC) is a good metric because it sort of combines accuracy, specificity and sensitivity:

We compute this metric in Python as follows:

import matplotlib.pyplot as plt
import numpy as np
from sklearn import metrics

y = df_pred.survived
pred = model_dt.predict_proba(X=explanatory)[:,1]

fpr, tpr, thresholds = metrics.roc_curve(y, pred)
roc_auc = metrics.auc(fpr, tpr)

display = metrics.RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc,
estimator_name='example estimator')
display.plot()
plt.show()


roc_auc

0.8205066688353937


## Other Classification Models

Let's build other classification models by applying the same functions. In the end, computing Machine Learning models is the same thing all the time.

### RandomForestClassifier() in Python

#### Fit the Model

from sklearn.ensemble import RandomForestClassifier

model_rf = RandomForestClassifier()
model_rf.fit(X=explanatory, y=target)

RandomForestClassifier()


#### Calculate Predictions

df_pred['pred_rf'] = model_rf.predict(X=explanatory)
df_pred


#### Model's Score

model_rf.score(X=explanatory, y=target)

0.9117647058823529


### SVC() in Python

#### Fit the Model

from sklearn.svm import SVC

model_sv = SVC()
model_sv.fit(X=explanatory, y=target)

SVC()


#### Calculate Predictions

df_pred['pred_sv'] = model_sv.predict(X=explanatory)
df_pred


#### Model's Score

model_sv.score(X=explanatory, y=target)

0.6190476190476191


## Which One Is the Best Model? Why?

To simplify the explanation, we use accuracy as the metric to compare the models. We have the Random Forest as the best model with an accuracy of 91.17%.

model_dt.score(X=explanatory, y=target)

0.8025210084033614

model_rf.score(X=explanatory, y=target)

0.9117647058823529

model_sv.score(X=explanatory, y=target)

0.6190476190476191

df_pred.head(10)