Follow

Follow

# #02 | The Decision Tree Classifier & Supervised Classification Models

## This tutorial shows you the step-by-step resolution of possible errors you may get as you develop a Decision Tree Classifier.

Jesús López
·May 6, 2022·

Featured on Hashnode • Introduction to Supervised Classification Models
• How do we compute a Decision Tree Model in Python?
• Data Preprocessing
• Other Classification Models
• Which One Is the Best Model? Why?

Don't miss out on his posts on LinkedIn to become a more efficient Python developer.

## Introduction to Supervised Classification Models

Machine Learning is a field that focuses on getting a mathematical equation to make predictions. Although not all Machine Learning models work the same way.

Which types of Machine Learning models can we distinguish so far?

• Classifiers to predict Categorical Variables

• Regressors to predict Numerical Variables

The previous chapter covered the explanation of a Regressor model: Linear Regression.

This chapter covers the explanation of a Classification model: the Decision Tree.

Why do they belong to Machine Learning?

• The Machine wants to get the best numbers of a mathematical equation such that the difference between reality and predictions is minimum:

• Classifier evaluates the model based on prediction success rate y=?y^

• Regressor evaluates the model based on the distance between real data and predictions (residuals) y−y^

There are many Machine Learning Models of each type.

You don't need to know the process behind each model because they all work the same way (see article). In the end, you will choose the one that makes better predictions.

This tutorial will show you how to develop a Decision Tree to calculate the probability of a person surviving the Titanic and the different evaluation metrics we can calculate on Classification Models.

Table of Important Content

• Dummy Variables

• Missing Data

1. ⚠️ How to evaluate Classification models?

• This dataset represents people (rows) aboard the Titanic

• And their sociological characteristics (columns)

``````import seaborn as sns #!
import pandas as pd

df_titanic = sns.load_dataset(name='titanic')[['survived', 'sex', 'age', 'embarked', 'class']]
df_titanic
`````` ## How do we compute a Decision Tree Model in Python?

We should know from the previous chapter that we need a function accessible from a Class in the library `sklearn`.

### Import the Class

``````from sklearn.tree import DecisionTreeClassifier
``````

### Instantiante the Class

To create a copy of the original's code blueprint to not "modify" the source code.

``````model_dt = DecisionTreeClassifier()
``````

### Access the Function

The theoretical action we'd like to perform is the same as we executed in the previous chapter. Therefore, the function should be called the same way:

``````model_dt.fit()
``````

---------------------------------------------------------------------------

TypeError Traceback (most recent call last)

/var/folders/24/tg28vxls25l9mjvqrnh0plc80000gn/T/ipykernel_3553/3699705032.py in ----> 1 model_dt.fit()

TypeError: fit() missing 2 required positional arguments: 'X' and 'y'

Why is it asking for two parameters: `y` and `X`?

• `y`: target ~ independent ~ label ~ class variable

• `X`: explanatory ~ dependent ~ feature variables

### Separate the Variables

``````target = df_titanic['survived']
explanatory = df_titanic.drop(columns='survived')
``````

### Fit the Model

``````model_dt.fit(X=explanatory, y=target)
``````

---------------------------------------------------------------------------

ValueError: could not convert string to float: 'male'

Most of the time, the data isn't prepared to fit the model. So let's dig into why we got the previous error in the following sections.

## Data Preprocessing

The error says:

``````ValueError: could not convert string to float: 'male'
``````

From which we can interpret that the function `.fit()` does not accept values of `string` type like the ones in `sex` column:

``````df_titanic
`````` ### Dummy Variables

Therefore, we need to convert the categorical columns to dummies (0s & 1s):

``````pd.get_dummies(df_titanic, drop_first=True)
`````` ``````df_titanic = pd.get_dummies(df_titanic, drop_first=True)
``````

We separate the variables again to take into account the latest modification:

``````explanatory = df_titanic.drop(columns='survived')
target = df_titanic[['survived']]
``````

### Fit the Model Again

Now we should be able to fit the model:

``````model_dt.fit(X=explanatory, y=target)
``````

---------------------------------------------------------------------------

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

### Missing Data

The data passed to the function contains missing data (`NaN`). Precisely 177 people from which we don't have the age:

``````df_titanic.isna()
`````` ``````df_titanic.isna().sum()
``````

survived 0 age 177 sex_male 0 embarked_Q 0 embarked_S 0 class_Second 0 class_Third 0 dtype: int64

Who are the people who lack the information?

``````mask_na = df_titanic.isna().sum(axis=1) > 0
``````
``````df_titanic[mask_na]
`````` What could we do with them?

1. Drop the people (rows) who miss the age from the dataset.

2. Fill the age by the average age of other combinations (like males who survived)

3. Apply an algorithm to fill them.

We'll choose option 1 to simplify the tutorial.

Therefore, we go from 891 people:

``````df_titanic
`````` To 714 people:

``````df_titanic.dropna()
`````` ``````df_titanic = df_titanic.dropna()
``````

We separate the variables again to take into account the latest modification:

``````explanatory = df_titanic.drop(columns='survived')
target = df_titanic['survived']
``````

Now we shouldn't have any more trouble with the data to fit the model.

### Fit the Model Again

We don't get any errors because we correctly preprocess the data for the model.

Once the model is fitted, we may observe that the object contains more attributes because it has calculated the best numbers for the mathematical equation.

``````model_dt.fit(X=explanatory, y=target)
model_dt.__dict__
``````

{'criterion': 'gini', 'splitter': 'best', 'max_depth': None, 'min_samples_split': 2, 'min_samples_leaf': 1, 'min_weight_fraction_leaf': 0.0, 'max_features': None, 'max_leaf_nodes': None, 'random_state': None, 'min_impurity_decrease': 0.0, 'class_weight': None, 'ccp_alpha': 0.0, 'feature_names_in_': array(['age', 'sex_male', 'embarked_Q', 'embarked_S', 'class_Second', 'class_Third'], dtype=object), 'n_features_in_': 6, 'n_outputs_': 1, 'classes_': array([0, 1]), 'n_classes_': 2, 'max_features_': 6, 'tree_': <sklearn.tree._tree.Tree at 0x16612cce0>}

### Predictions

#### Calculate Predictions

We have a fitted `DecisionTreeClassifier`. Therefore, we should be able to apply the mathematical equation to the original data to get the predictions:

``````model_dt.predict_proba(X=explanatory)[:5]
``````

array([[0.82051282, 0.17948718], [0.05660377, 0.94339623], [0.53921569, 0.46078431], [0.05660377, 0.94339623], [0.82051282, 0.17948718]])

#### Add a New Column with the Predictions

Let's create a new `DataFrame` to keep the information of the target and predictions to understand the topic better:

``````df_pred = df_titanic[['survived']].copy()
``````

``````df_pred['pred_proba_dt'] = model_dt.predict_proba(X=explanatory)[:,1]
df_pred
`````` How have we calculated those predictions?

### Model Visualization

The Decision Tree model doesn't specifically have a mathematical equation. But instead, a set of conditions is represented in a tree:

``````from sklearn.tree import plot_tree

plot_tree(decision_tree=model_dt);
`````` There are many conditions; let's recreate a shorter tree to explain the Mathematical Equation of the Decision Tree:

``````model_dt = DecisionTreeClassifier(max_depth=2)
model_dt.fit(X=explanatory, y=target)

plot_tree(decision_tree=model_dt);
`````` Let's make the image bigger:

``````import matplotlib.pyplot as plt

plt.figure(figsize=(10,6))
plot_tree(decision_tree=model_dt);
`````` The conditions are `X<=0.5`. The `X` means the 3rd variable (Python starts counting at 0) of the explanatory ones. If we'd like to see the names of the columns, we need to add the `feature_names` parameter:

``````explanatory.columns
``````

Index(['age', 'sex_male', 'embarked_Q', 'embarked_S', 'class_Second', 'class_Third'], dtype='object')

``````import matplotlib.pyplot as plt

plt.figure(figsize=(10,6))
plot_tree(decision_tree=model_dt, feature_names=explanatory.columns);
`````` Let's add some colours to see how the predictions will go based on the fulfilled conditions:

``````import matplotlib.pyplot as plt

plt.figure(figsize=(10,6))
plot_tree(decision_tree=model_dt, feature_names=explanatory.columns, filled=True);
`````` ### How does the Decision Tree Algorithm computes the Mathematical Equation?

The Decision Tree and the Linear Regression algorithms look for the best numbers in a mathematical equation. The following video explains how the Decision Tree configures the equation:

### Model Interpretation

Let's take a person from the data to explain how the model makes a prediction. For storytelling, let's say the person's name is John.

John is a 22-year-old man who took the titanic on 3rd class but didn't survive:

``````df_titanic[:1]
`````` To calculate the chances of survival in a person like John, we pass the explanatory variables of John:

``````explanatory[:1]
`````` To the function `.predict_proba()` and get a probability of 17.94%:

``````model_dt.predict_proba(X=explanatory[:1])
``````

array([[0.82051282, 0.17948718]])

But wait, how did we get to the probability of survival of 17.94%?

Let's explain it step-by-step with the Decision Tree visualization:

``````plt.figure(figsize=(10,6))
plot_tree(decision_tree=model_dt, feature_names=explanatory.columns, filled=True);
`````` Based on the tree, the conditions are:

#### 1st condition

• sex_male (John=1) <= 0.5 ~ False

John doesn't fulfil the condition; we move to the right side of the tree.

#### 2nd condition

• age (John=22.0) <= 6.5 ~ False

John doesn't fulfil the condition; we move to the right side of the tree.

#### Leaf

The ultimate node, the leaf, tells us that the training dataset contained 429 males older than 6.5 years old.

Out of the 429, 77 survived, but 352 didn't make it.

Therefore, the chances of John surviving according to our model are 77 divided by 429:

``````77/429
``````

0.1794871794871795

We get the same probability; John had a 17.94% chance of surviving the Titanic accident.

### Model's Score

#### Calculate the Score

As always, we should have a function to calculate the goodness of the model:

``````model_dt.score(X=explanatory, y=target)
``````

0.8025210084033614

The model can correctly predict 80.25% of the people in the dataset.

What's the reasoning behind the model's evaluation?

#### The Score Step-by-step

As we saw earlier, the classification model calculates the probability for an event to occur. The function `.predict_proba()` gives us two probabilities in the columns: people who didn't survive (0) and people who survived (1).

``````model_dt.predict_proba(X=explanatory)[:5]
``````

array([[0.82051282, 0.17948718], [0.05660377, 0.94339623], [0.53921569, 0.46078431], [0.05660377, 0.94339623], [0.82051282, 0.17948718]])

We take the positive probabilities in the second column:

``````df_pred['pred_proba_dt'] = model_dt.predict_proba(X=explanatory)[:, 1]
``````

At the time to compare reality (0s and 1s) with the predictions (probabilities), we need to turn probabilities higher than 0.5 into 1, and 0 otherwise.

``````import numpy as np

df_pred['pred_dt'] = np.where(df_pred.pred_proba_dt > 0.5, 1, 0)
df_pred
`````` The simple idea of the accuracy is to get the success rate on the classification: how many people do we get right?

We compare if the reality is equal to the prediction:

``````comp = df_pred.survived == df_pred.pred_dt
comp
``````

0 True 1 True ...
889 False 890 True Length: 714, dtype: bool

If we sum the boolean Series, Python will take True as 1 and 0 as False to compute the number of correct classifications:

``````comp.sum()
``````

573

We get the score by dividing the successes by all possibilities (the total number of people):

``````comp.sum()/len(comp)
``````

0.8025210084033614

It is also correct to do the mean on the comparisons because it's the sum divided by the total. Observe how you get the same number:

``````comp.mean()
``````

0.8025210084033614

But it's more efficient to calculate this metric with the function `.score()`:

``````model_dt.score(X=explanatory, y=target)
``````

0.8025210084033614

### The Confusion Matrix to Compute Other Classification Metrics

Can we think that our model is 80.25% of good and be happy with it?

• We should not because we might be interested in the accuracy of each class (survived or not) separately. But first, we need to compute the confusion matrix:
``````from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(
y_true=df_pred.survived,
y_pred=df_pred.pred_dt
)

CM = ConfusionMatrixDisplay(cm)
CM.plot();
`````` 1. Looking at the first number of the confusion matrix, we have 407 people who didn't survive the Titanic in reality and the predictions.

2. It is not the case with the number 17. Our model classified 17 people as survivors when they didn't.

3. The success rate of the negative class, people who didn't survive, is called the specificity: \$407/(407+17)\$.

4. Whereas the success rate of the positive class, people who did survive, is called the sensitivity: \$166/(166+124)\$.

#### Specificity (Recall=0)

``````cm[0,0]
``````

407

``````cm[0,:]
``````

array([407, 17])

``````cm[0,0]/cm[0,:].sum()
``````

0.9599056603773585

``````sensitivity = cm[0,0]/cm[0,:].sum()
``````

#### Sensitivity (Recall=1)

``````cm[1,1]
``````

166

``````cm[1,:]
``````

array([124, 166])

``````cm[1,1]/cm[1,:].sum()
``````

0.5724137931034483

``````sensitivity = cm[1,1]/cm[1,:].sum()
``````

#### Classification Report

We could have gotten the same metrics using the function `classification_report()`. Look a the recall (column) of rows 0 and 1, specificity and sensitivity, respectively:

``````from sklearn.metrics import classification_report

report = classification_report(
y_true=df_pred.survived,
y_pred=df_pred.pred_dt
)

print(report)
``````

precision recall f1-score support

0 0.77 0.96 0.85 424 1 0.91 0.57 0.70 290

accuracy 0.80 714 macro avg 0.84 0.77 0.78 714 weighted avg 0.82 0.80 0.79 714

We can also create a nice `DataFrame` to later use the data for simulations:

``````report = classification_report(
y_true=df_pred.survived,
y_pred=df_pred.pred_dt,
output_dict=True
)

pd.DataFrame(report)
`````` Our model is not as good as we thought if we predict the people who survived; we get 57.24% of survivors right.

How can we then assess a reasonable rate for our model?

#### ROC Curve

Watch the following video to understand how the Area Under the Curve (AUC) is a good metric because it sort of combines accuracy, specificity and sensitivity:

We compute this metric in Python as follows:

``````import matplotlib.pyplot as plt
import numpy as np
from sklearn import metrics

y = df_pred.survived
pred = model_dt.predict_proba(X=explanatory)[:,1]

fpr, tpr, thresholds = metrics.roc_curve(y, pred)
roc_auc = metrics.auc(fpr, tpr)

display = metrics.RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc,
estimator_name='example estimator')
display.plot()
plt.show()
`````` ``````roc_auc
``````

0.8205066688353937

## Other Classification Models

Let's build other classification models by applying the same functions. In the end, computing Machine Learning models is the same thing all the time.

### `RandomForestClassifier()` in Python

#### Fit the Model

``````from sklearn.ensemble import RandomForestClassifier

model_rf = RandomForestClassifier()
model_rf.fit(X=explanatory, y=target)
``````

RandomForestClassifier()

#### Calculate Predictions

``````df_pred['pred_rf'] = model_rf.predict(X=explanatory)
df_pred
`````` #### Model's Score

``````model_rf.score(X=explanatory, y=target)
``````

0.9117647058823529

### `SVC()` in Python

#### Fit the Model

``````from sklearn.svm import SVC

model_sv = SVC()
model_sv.fit(X=explanatory, y=target)
``````

SVC()

#### Calculate Predictions

``````df_pred['pred_sv'] = model_sv.predict(X=explanatory)
df_pred
`````` #### Model's Score

``````model_sv.score(X=explanatory, y=target)
``````

0.6190476190476191

## Which One Is the Best Model? Why?

To simplify the explanation, we use accuracy as the metric to compare the models. We have the Random Forest as the best model with an accuracy of 91.17%.

``````model_dt.score(X=explanatory, y=target)
``````

0.8025210084033614

``````model_rf.score(X=explanatory, y=target)
``````

0.9117647058823529

``````model_sv.score(X=explanatory, y=target)
``````

0.6190476190476191

``````df_pred.head(10)
``````  