Why all Machine Learning Models are the Same?

Why all Machine Learning Models are the Same?

Understand how all Machine Learning Models follow the same procedure over and over again

Jesús López's photo
Jesús López
·Oct 1, 2021·

4 min read

Subscribe to my newsletter and never miss my upcoming articles

Table of contents

  • Introduction
  • Load the Data
  • Data Preprocessing
  • Feature Selection
  • ML Models
  • Comparing Predictions
  • Choose Best Model

Introduction

It's tough to find things that always work the same way in programming.

The steps of a Machine Learning (ML) model can be an exception.

Each time we want to compute a model (mathematical equation) and make predictions with it, we would always make the following steps:

  1. model.fit() → to compute the numbers of the mathematical equation..
  2. model.predict() → to calculate predictions through the mathematical equation.
  3. model.score() → to measure how good the model's predictions are.

And I am going to show you this with 3 different ML models.

  • DecisionTreeClassifier()
  • RandomForestClassifier()
  • LogisticRegression()

Load the Data

But first, let's load a dataset from CIS executing the lines of code below:

  • The goal of this dataset is
  • To predict internet_usage of people (rows)
  • Based on their socio-demographical characteristics (columns)
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/jsulopz/data/main/internet_usage_spain.csv')
df.head()
internet_usage sex age education
0 0 Female 66 Elementary
1 1 Male 72 Elementary
2 1 Male 48 University
3 0 Male 59 PhD
4 1 Female 44 PhD

Data Preprocessing

We need to transform the categorical variables to dummy variables before computing the models:

df = pd.get_dummies(df, drop_first=True)
df.head()

df_dummy_head.png

Feature Selection

Now we separate the variables on their respective role within the model:

target = df.internet_usage
explanatory = df.drop(columns='internet_usage')

ML Models

Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()
model.fit(X=explanatory, y=target)

pred_dt = model.predict(X=explanatory)
accuracy_dt = model.score(X=explanatory, y=target)

Support Vector Machines

from sklearn.svm import SVC

model = SVC()
model.fit(X=explanatory, y=target)

pred_sv = model.predict(X=explanatory)
accuracy_sv = model.score(X=explanatory, y=target)

K Nearest Neighbour

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()
model.fit(X=explanatory, y=target)

pred_kn = model.predict(X=explanatory)
accuracy_kn = model.score(X=explanatory, y=target)

The only thing that changes are the results of the prediction. The models are different. But they all follow the same steps that we described at the beginning:

  1. model.fit() → to compute the mathematical formula of the model
  2. model.predict() → to calculate predictions through the mathematical formula
  3. model.score() → to get the success ratio of the model

Comparing Predictions

You may observe in the following table how the different models make different predictions, which often doesn't coincide with reality (misclassification).

For example, model_svm doesn't correctly predict the row 214; as if this person used internet pred_svm=1, but they didn't: internet_usage for 214 in reality is 0.

df_pred = pd.DataFrame({'internet_usage': df.internet_usage,
                        'pred_dt': pred_dt,
                        'pred_svm': pred_sv,
                        'pred_lr': pred_kn})

df_pred.sample(10, random_state=7)
internet_usage pred_dt pred_svm pred_lr
214 0 0 1 0
2142 1 1 1 1
1680 1 0 0 0
1522 1 1 1 1
325 1 1 1 1
2283 1 1 1 1
1263 0 0 0 0
993 0 0 0 0
26 1 1 1 1
2190 0 0 0 0

Choose Best Model

Then, we could choose the model with a higher number of successes on predicting the reality.

df_accuracy = pd.DataFrame({'accuracy': [accuracy_dt, accuracy_sv, accuracy_kn]},
                           index = ['DecisionTreeClassifier()', 'SVC()', 'KNeighborsClassifier()'])

df_accuracy
accuracy
DecisionTreeClassifier() 0.859878
SVC() 0.783707
KNeighborsClassifier() 0.827291

Which is the best model here?

  • Let me know in the comments below ↓

Did you find this article valuable?

Support Jesús López by becoming a sponsor. Any amount is appreciated!

See recent sponsors Learn more about Hashnode Sponsors
 
Share this