Resolving Decision Tree's Common Mistakes using Python

© Jesús López 2022

Don't miss out on his posts on LinkedIn to become a more efficient Python developer.

Introduction to Supervised Classification Models

Machine Learning is a field that focuses on getting a mathematical equation to make predictions. Although not all Machine Learning models work the same way.

Which types of Machine Learning models can we distinguish so far?

Classifiers to predict Categorical Variables
Regressors to predict Numerical Variables

The previous chapter covered the explanation of a Regressor model: Linear Regression.

This chapter covers the explanation of a Classification model: the Decision Tree.

Why do they belong to Machine Learning?

The Machine wants to get the best numbers of a mathematical equation such that the difference between reality and predictions is minimum:
- Classifier evaluates the model based on prediction success rate y=?y^
- Regressor evaluates the model based on the distance between real data and predictions (residuals) y−y^

There are many Machine Learning Models of each type.

You don't need to know the process behind each model because they all work the same way (see article). In the end, you will choose the one that makes better predictions.

This tutorial will show you how to develop a Decision Tree to calculate the probability of a person surviving the Titanic and the different evaluation metrics we can calculate on Classification Models.

Table of Important Content

🛀 How to preprocess/clean the data to fit a Machine Learning model?
- Dummy Variables
- Missing Data
🤩 How to visualize a Decision Tree model in Python step by step?
🤔 How to interpret the nodes and leaf's values of a Decision Tree plot?
⚠️ How to evaluate Classification models?
- Accuracy
- Confussion Matrix
  - Sensitivity
  - Specificity
  - ROC Curve
🏁 How to compare Classification models to choose the best one?

Load the Data

This dataset represents people (rows) aboard the Titanic
And their sociological characteristics (columns)

import seaborn as sns #!
import pandas as pd

df_titanic = sns.load_dataset(name='titanic')[['survived', 'sex', 'age', 'embarked', 'class']]
df_titanic

How do we compute a Decision Tree Model in Python?

We should know from the previous chapter that we need a function accessible from a Class in the library sklearn.

Import the Class

from sklearn.tree import DecisionTreeClassifier

Instantiante the Class

To create a copy of the original's code blueprint to not "modify" the source code.

model_dt = DecisionTreeClassifier()

Access the Function

The theoretical action we'd like to perform is the same as we executed in the previous chapter. Therefore, the function should be called the same way:

model_dt.fit()

---------------------------------------------------------------------------

TypeError Traceback (most recent call last)

/var/folders/24/tg28vxls25l9mjvqrnh0plc80000gn/T/ipykernel_3553/3699705032.py in ----> 1 model_dt.fit()

TypeError: fit() missing 2 required positional arguments: 'X' and 'y'

Why is it asking for two parameters: y and X?

y: target ~ independent ~ label ~ class variable
X: explanatory ~ dependent ~ feature variables

Separate the Variables

target = df_titanic['survived']
explanatory = df_titanic.drop(columns='survived')

Fit the Model

model_dt.fit(X=explanatory, y=target)

---------------------------------------------------------------------------

ValueError: could not convert string to float: 'male'

Most of the time, the data isn't prepared to fit the model. So let's dig into why we got the previous error in the following sections.

Data Preprocessing

The error says:

ValueError: could not convert string to float: 'male'

From which we can interpret that the function .fit() does not accept values of string type like the ones in sex column:

df_titanic

Dummy Variables

Therefore, we need to convert the categorical columns to dummies (0s & 1s):

pd.get_dummies(df_titanic, drop_first=True)

df_titanic = pd.get_dummies(df_titanic, drop_first=True)

We separate the variables again to take into account the latest modification:

explanatory = df_titanic.drop(columns='survived')
target = df_titanic[['survived']]

Fit the Model Again

Now we should be able to fit the model:

model_dt.fit(X=explanatory, y=target)

---------------------------------------------------------------------------

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

Missing Data

The data passed to the function contains missing data (NaN). Precisely 177 people from which we don't have the age:

df_titanic.isna()

df_titanic.isna().sum()

survived 0 age 177 sex_male 0 embarked_Q 0 embarked_S 0 class_Second 0 class_Third 0 dtype: int64

Who are the people who lack the information?

mask_na = df_titanic.isna().sum(axis=1) > 0

df_titanic[mask_na]

What could we do with them?

Drop the people (rows) who miss the age from the dataset.
Fill the age by the average age of other combinations (like males who survived)
Apply an algorithm to fill them.

We'll choose option 1 to simplify the tutorial.

Therefore, we go from 891 people:

df_titanic

To 714 people:

df_titanic.dropna()

df_titanic = df_titanic.dropna()

We separate the variables again to take into account the latest modification:

explanatory = df_titanic.drop(columns='survived')
target = df_titanic['survived']

Now we shouldn't have any more trouble with the data to fit the model.

Fit the Model Again

We don't get any errors because we correctly preprocess the data for the model.

Once the model is fitted, we may observe that the object contains more attributes because it has calculated the best numbers for the mathematical equation.

model_dt.fit(X=explanatory, y=target)
model_dt.__dict__

{'criterion': 'gini', 'splitter': 'best', 'max_depth': None, 'min_samples_split': 2, 'min_samples_leaf': 1, 'min_weight_fraction_leaf': 0.0, 'max_features': None, 'max_leaf_nodes': None, 'random_state': None, 'min_impurity_decrease': 0.0, 'class_weight': None, 'ccp_alpha': 0.0, 'feature_names_in_': array(['age', 'sex_male', 'embarked_Q', 'embarked_S', 'class_Second', 'class_Third'], dtype=object), 'n_features_in_': 6, 'n_outputs_': 1, 'classes_': array([0, 1]), 'n_classes_': 2, 'max_features_': 6, 'tree_': <sklearn.tree._tree.Tree at 0x16612cce0>}

Learn how to become an independent Machine Learning programmer who knows when to apply any ML algorithm to any dataset.

Predictions

Calculate Predictions

We have a fitted DecisionTreeClassifier. Therefore, we should be able to apply the mathematical equation to the original data to get the predictions:

model_dt.predict_proba(X=explanatory)[:5]

array([[0.82051282, 0.17948718], [0.05660377, 0.94339623], [0.53921569, 0.46078431], [0.05660377, 0.94339623], [0.82051282, 0.17948718]])

Add a New Column with the Predictions

Let's create a new DataFrame to keep the information of the target and predictions to understand the topic better:

df_pred = df_titanic[['survived']].copy()

And add the predictions:

df_pred['pred_proba_dt'] = model_dt.predict_proba(X=explanatory)[:,1]
df_pred

How have we calculated those predictions?

Model Visualization

The Decision Tree model doesn't specifically have a mathematical equation. But instead, a set of conditions is represented in a tree:

from sklearn.tree import plot_tree

plot_tree(decision_tree=model_dt);

There are many conditions; let's recreate a shorter tree to explain the Mathematical Equation of the Decision Tree:

model_dt = DecisionTreeClassifier(max_depth=2)
model_dt.fit(X=explanatory, y=target)

plot_tree(decision_tree=model_dt);

Let's make the image bigger:

import matplotlib.pyplot as plt

plt.figure(figsize=(10,6))
plot_tree(decision_tree=model_dt);

The conditions are X[2]<=0.5. The X[2] means the 3rd variable (Python starts counting at 0) of the explanatory ones. If we'd like to see the names of the columns, we need to add the feature_names parameter:

explanatory.columns

Index(['age', 'sex_male', 'embarked_Q', 'embarked_S', 'class_Second', 'class_Third'], dtype='object')

import matplotlib.pyplot as plt

plt.figure(figsize=(10,6))
plot_tree(decision_tree=model_dt, feature_names=explanatory.columns);

Let's add some colours to see how the predictions will go based on the fulfilled conditions:

import matplotlib.pyplot as plt

plt.figure(figsize=(10,6))
plot_tree(decision_tree=model_dt, feature_names=explanatory.columns, filled=True);

How does the Decision Tree Algorithm computes the Mathematical Equation?

The Decision Tree and the Linear Regression algorithms look for the best numbers in a mathematical equation. The following video explains how the Decision Tree configures the equation:

https://www.youtube.com/watch?v=_L39rN6gz7Y

Model Interpretation

Let's take a person from the data to explain how the model makes a prediction. For storytelling, let's say the person's name is John.

John is a 22-year-old man who took the titanic on 3rd class but didn't survive:

df_titanic[:1]

To calculate the chances of survival in a person like John, we pass the explanatory variables of John:

explanatory[:1]

To the function .predict_proba() and get a probability of 17.94%:

model_dt.predict_proba(X=explanatory[:1])

array([[0.82051282, 0.17948718]])

But wait, how did we get to the probability of survival of 17.94%?

Let's explain it step-by-step with the Decision Tree visualization:

plt.figure(figsize=(10,6))
plot_tree(decision_tree=model_dt, feature_names=explanatory.columns, filled=True);

Based on the tree, the conditions are:

1st condition

sex_male (John=1) <= 0.5 ~ False

John doesn't fulfil the condition; we move to the right side of the tree.

2nd condition

age (John=22.0) <= 6.5 ~ False

John doesn't fulfil the condition; we move to the right side of the tree.

Leaf

The ultimate node, the leaf, tells us that the training dataset contained 429 males older than 6.5 years old.

Out of the 429, 77 survived, but 352 didn't make it.

Therefore, the chances of John surviving according to our model are 77 divided by 429:

77/429

0.1794871794871795

We get the same probability; John had a 17.94% chance of surviving the Titanic accident.

Model's Score

Calculate the Score

As always, we should have a function to calculate the goodness of the model:

model_dt.score(X=explanatory, y=target)

0.8025210084033614

The model can correctly predict 80.25% of the people in the dataset.

What's the reasoning behind the model's evaluation?

The Score Step-by-step

As we saw earlier, the classification model calculates the probability for an event to occur. The function .predict_proba() gives us two probabilities in the columns: people who didn't survive (0) and people who survived (1).

model_dt.predict_proba(X=explanatory)[:5]

array([[0.82051282, 0.17948718], [0.05660377, 0.94339623], [0.53921569, 0.46078431], [0.05660377, 0.94339623], [0.82051282, 0.17948718]])

We take the positive probabilities in the second column:

df_pred['pred_proba_dt'] = model_dt.predict_proba(X=explanatory)[:, 1]

At the time to compare reality (0s and 1s) with the predictions (probabilities), we need to turn probabilities higher than 0.5 into 1, and 0 otherwise.

import numpy as np

df_pred['pred_dt'] = np.where(df_pred.pred_proba_dt > 0.5, 1, 0)
df_pred

The simple idea of the accuracy is to get the success rate on the classification: how many people do we get right?

We compare if the reality is equal to the prediction:

comp = df_pred.survived == df_pred.pred_dt
comp

0 True 1 True ...
889 False 890 True Length: 714, dtype: bool

If we sum the boolean Series, Python will take True as 1 and 0 as False to compute the number of correct classifications:

comp.sum()

573

We get the score by dividing the successes by all possibilities (the total number of people):

comp.sum()/len(comp)

0.8025210084033614

It is also correct to do the mean on the comparisons because it's the sum divided by the total. Observe how you get the same number:

comp.mean()

0.8025210084033614

But it's more efficient to calculate this metric with the function .score():

model_dt.score(X=explanatory, y=target)

0.8025210084033614

The Confusion Matrix to Compute Other Classification Metrics

Can we think that our model is 80.25% of good and be happy with it?

We should not because we might be interested in the accuracy of each class (survived or not) separately. But first, we need to compute the confusion matrix:

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(
    y_true=df_pred.survived,
    y_pred=df_pred.pred_dt
)

CM = ConfusionMatrixDisplay(cm)
CM.plot();

Looking at the first number of the confusion matrix, we have 407 people who didn't survive the Titanic in reality and the predictions.
It is not the case with the number 17. Our model classified 17 people as survivors when they didn't.
The success rate of the negative class, people who didn't survive, is called the specificity: $407/(407+17)$.
Whereas the success rate of the positive class, people who did survive, is called the sensitivity: $166/(166+124)$.

Specificity (Recall=0)

cm[0,0]

407

cm[0,:]

array([407, 17])

cm[0,0]/cm[0,:].sum()

0.9599056603773585

sensitivity = cm[0,0]/cm[0,:].sum()

Sensitivity (Recall=1)

cm[1,1]

166

cm[1,:]

array([124, 166])

cm[1,1]/cm[1,:].sum()

0.5724137931034483

sensitivity = cm[1,1]/cm[1,:].sum()

Classification Report

We could have gotten the same metrics using the function classification_report(). Look a the recall (column) of rows 0 and 1, specificity and sensitivity, respectively:

from sklearn.metrics import classification_report

report = classification_report(
    y_true=df_pred.survived,
    y_pred=df_pred.pred_dt
)

print(report)

precision recall f1-score support

0 0.77 0.96 0.85 424 1 0.91 0.57 0.70 290

accuracy 0.80 714 macro avg 0.84 0.77 0.78 714 weighted avg 0.82 0.80 0.79 714

We can also create a nice DataFrame to later use the data for simulations:

report = classification_report(
    y_true=df_pred.survived,
    y_pred=df_pred.pred_dt,
    output_dict=True
)

pd.DataFrame(report)

Our model is not as good as we thought if we predict the people who survived; we get 57.24% of survivors right.

How can we then assess a reasonable rate for our model?

ROC Curve

Watch the following video to understand how the Area Under the Curve (AUC) is a good metric because it sort of combines accuracy, specificity and sensitivity:

https://www.youtube.com/watch?v=4jRBRDbJemM

We compute this metric in Python as follows:

import matplotlib.pyplot as plt
import numpy as np
from sklearn import metrics

y = df_pred.survived
pred = model_dt.predict_proba(X=explanatory)[:,1]

fpr, tpr, thresholds = metrics.roc_curve(y, pred)
roc_auc = metrics.auc(fpr, tpr)

display = metrics.RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc,
                                  estimator_name='example estimator')
display.plot()
plt.show()

roc_auc

0.8205066688353937

Other Classification Models

Let's build other classification models by applying the same functions. In the end, computing Machine Learning models is the same thing all the time.

`RandomForestClassifier()` in Python

Fit the Model

from sklearn.ensemble import RandomForestClassifier

model_rf = RandomForestClassifier()
model_rf.fit(X=explanatory, y=target)

RandomForestClassifier()

Calculate Predictions

df_pred['pred_rf'] = model_rf.predict(X=explanatory)
df_pred

Model's Score

model_rf.score(X=explanatory, y=target)

0.9117647058823529

`SVC()` in Python

Fit the Model

from sklearn.svm import SVC

model_sv = SVC()
model_sv.fit(X=explanatory, y=target)

SVC()

Calculate Predictions

df_pred['pred_sv'] = model_sv.predict(X=explanatory)
df_pred

Model's Score

model_sv.score(X=explanatory, y=target)

0.6190476190476191

Which One Is the Best Model? Why?

To simplify the explanation, we use accuracy as the metric to compare the models. We have the Random Forest as the best model with an accuracy of 91.17%.

model_dt.score(X=explanatory, y=target)

0.8025210084033614

model_rf.score(X=explanatory, y=target)

0.9117647058823529

model_sv.score(X=explanatory, y=target)

0.6190476190476191

df_pred.head(10)

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Command Palette

Introduction to Supervised Classification Models

Load the Data

How do we compute a Decision Tree Model in Python?

Import the Class

Instantiante the Class

Access the Function

Separate the Variables

Fit the Model

Data Preprocessing

Dummy Variables

Fit the Model Again

Missing Data

Fit the Model Again

Predictions

Calculate Predictions

Add a New Column with the Predictions

Model Visualization

How does the Decision Tree Algorithm computes the Mathematical Equation?

Model Interpretation

1st condition

2nd condition

Leaf

Model's Score

Calculate the Score

The Score Step-by-step

The Confusion Matrix to Compute Other Classification Metrics

Specificity (Recall=0)

Sensitivity (Recall=1)

Classification Report

ROC Curve

Other Classification Models

RandomForestClassifier() in Python

Fit the Model

Calculate Predictions

Model's Score

SVC() in Python

Fit the Model

Calculate Predictions

Model's Score

Which One Is the Best Model? Why?

Comments

Machine Learning

#06 | The Principal Component Analysis (PCA) & Dimensionality Reduction Techniques

More from this blog

`RandomForestClassifier()` in Python

`SVC()` in Python