This morning, I read the Economist Espresso on India's pollution season, and I thought it was a good idea to start the series of challenges with this topic.

After navigating many websites, such as India's Central Pollution Control Board and WHO, I found this website about Air Quality Data where we can download the data from many places worldwide.

I chose Delhi to be the city we will analyze in this challenge.

Executing the following lines of code will produce the DataFrame we'll work with:

```
import pandas as pd
df = pd.read_csv('anand-vihar, delhi-air-quality.csv', parse_dates=['date'], index_col=0)
df
```

I needed to process the data to deliver a workable dataset in the following way:

```
#remove whitespaces in columns
df.columns = df.columns.str.strip()
#get the rows with the numbers (some of them where whitespaces)
series = df['pm25'].str.extract('(\w+)')[0]
#rolling average to armonize the data monthly
series_monthly = series.rolling(30).mean()
#remove missing dates
series_monthly = series_monthly.dropna()
#fill missing dates by linear interpolation
series_monthly = series_monthly.interpolate(method='linear')
#sorting the index to later make a reasonable plot
series_monthly = series_monthly.sort_index()
#aggregate the information by month
series_monthly = series_monthly.to_period('M').groupby(level='date').mean()
#process a timestamp to avoid errors with statsmodels' functions
series_monthly = series_monthly.to_timestamp()
#setting freq to avoid errors with statsmodels' functions
series_monthly = series_monthly.asfreq("MS").interpolate()
#change the name of the pandas.Series
series_monthly.name = 'air pollution pm25'
```

As we don't know the coding skills of this Study Circle member, we'll start with simple ARIMA models. From this point, we will iterate the procedure and improve the dynamic.

To take on the challenge and maybe, receive some feedback, you should fork this repository to your GitHub account. Otherwise, you can download this script.

The end goal is to develop an ARIMA model and plot the predictions against the actual data. Resulting in a plot like the this.

Nevertheless, you can develop this challenge in any way you find attractive. The essential point of this Study Circle is the interactivity between the members to generate value and knowledge.

From your feedback, we could later work on different use cases. For example, we could later create a geospatial map in Python with the predictions.

So, let's get on and good luck!

You start with the following object:

Check out the following materials to learn how you could develop the challenge:

- Video Tutorial: How to develop ARIMA models to predict Stock Price

```
series_monthly
```

```
date
2014-01-01 286.023457
2014-02-01 281.428205
...
2022-08-01 115.487097
2022-09-01 143.713333
Freq: MS, Name: air pollution pm25, Length: 105, dtype: float64
```

It's not the same to observe the data in numbers than in a chart:

```
series_monthly.plot();
```

We aim to compute a mathematical equation that we will later use to calculate predictions, as we can see in the following chart:

There are many types of mathematical equations, the one we will use is `ARIMA`

. Don't worry about the maths, we need a Python function to make it all for us.

```
from statsmodels.tsa.arima.model import ARIMA
```

The parameters of this Class ask for two objects:

`endog`

: the data`order`

: (p,d,q)`p`

: the first significant lag in the Autocorrelation Plot`d`

: the diff needed to make our data stationary`q`

: the first significant lag in the Partial Autocorrelation Plot

`d`

| Diff to get data stationarityThe first thing we need to check about our data is stationarity. We use the Augmented Dickey-Fuller test intending to reject the null hypothesis in which we state that the data is non-stationary. If that's the case, we need to differentiate the time series and adjust the number `d:1`

in the parameter `order=(p, d:1, q)`

.

```
from statsmodels.tsa.stattools import adfuller
result = adfuller(series_monthly)
```

The p-value is given by the second element the function `adfuller`

returns:

```
result[1]
```

```
-> 0.4244071993737921
```

The p-value is greater than 0.05. Therefore, we can't reject the null hypothesis.

Are we done here?

- No, we can differentiate the Time Series by one lag and test again:

```
series_monthly_diff_1 = series_monthly.diff().dropna()
result = adfuller(series_monthly_diff_1)
result[1]
```

```
-> 2.4066471086483724e-24
```

We can reject the null hypothesis and say that our data is stationary with a lag of 1. Therefore, we need to set `d:1`

in the `order`

parameter of the `ARIMA()`

class.

`q`

| Autocorrelation PlotNow we need to determine `q`

based on the first significant lag of the autocorrelation plot:

```
from statsmodels.graphics.tsaplots import plot_acf
plot_acf(series_monthly_diff_1, lags=50)
plt.xlabel('Lag');
```

The first significant lag is the 2, which means that our differentiated data (monthly) is correlated every two months. We set `q=2`

.

`p`

| Partial Autocorrelation PlotWe follow the same procedure to choose a number for `p`

. But this time, we use another type of plot: Partial Autocorrelation.

```
from statsmodels.graphics.tsaplots import plot_pacf
plot_pacf(series_monthly_diff_1, lags=50, method='ywm')
plt.xlabel('Lag');
```

We see the first significant lag at 2. Therefore, we set `p=2`

.

We already know which numbers we set on the `order`

parameter: `order=(p:2, d:1, q:2)`

. So, let's fit the mathematical equation of the model.

```
model = ARIMA(series_monthly, order=(2,1,2))
result = model.fit()
result.summary()
```

And calculate the predictions:

```
import matplotlib.pyplot as plt
plt.figure(figsize=(6,4))
series_monthly.plot(label='Actual Data')
result.predict().plot(label='Predicted Data')
plt.legend()
plt.xticks(rotation=45);
```

]]>`for`

loop if you want to summarise your daily Time Series by years.
Instead, use the function `resample()`

from pandas.

Let me explain it with an example.

We start by loading a DataFrame from a CSV file that contains information on the TSLA stock from 2017-2022.

```
import pandas as pd
url = 'https://raw.githubusercontent.com/jsulopzs/data/main/tsla_stock.csv'
df_tsla = pd.read_csv(filepath_or_buffer=url)
df_tsla
```

cc: @elonmusk

You're welcome for the promotion 馃槈

You must ensure that column `Date's dtype`

is DateTime.

It must not be an object as in the picture (often interpreted as a string).

```
df_tsla.dtypes.to_frame(name='dtype')
```

We need to convert the Date column into a `datetime`

dtype. To do so, we can use the function `pd.to_datetime()`

:

```
df_tsla.Date = pd.to_datetime(df_tsla.Date)
df_tsla.dtypes.to_frame(name='dtype')
```

Before getting into the resample() function, we need to **set the column Date as the index** of the DataFrame:

```
df_tsla.set_index('Date', inplace=True)
df_tsla
```

Now let the magic happen; we'll get the maximum value of each column by each year with this simple line of code:

```
df_tsla.resample(rule='Y').max()
```

We can do many other things:

- Summarise by Quarter.
- Calculate the average and the standard deviation (volatility).

```
df_tsla.resample(rule='Q').agg(['mean', 'std'])
```

To finish it, I always like to add a `background_gradient()`

to the DataFrame:

```
df_tsla.resample(rule='Y').max().style.background_gradient('Greens')
```

If you enjoyed this, I'd appreciate it if you could support my work by spreading the word 馃槉

]]>Sometimes, we want to select specific parts of the DataFrame to highlight some data points.

In this case, we refer to the topic as locating & filtering.

For example, let's load the dataset of cars:

```
import seaborn as sns
df_mpg = sns.load_dataset('mpg', index_col='name').drop(columns=['cylinders', 'model_year', 'origin'])
df_mpg
```

To filter the best cars in each statistics/column.

First, we calculate the maximum values in each column:

```
df_mpg.max()
```

```
mpg 46.6
displacement 455.0
horsepower 230.0
weight 5140.0
acceleration 24.8
dtype: float64
```

Then, we create a mask (array with True/False) to capture the rows where we have the cars with maximum values:

```
mask_max = (df_mpg == df_mpg.max()).sum(axis=1) > 0
mask_max
```

```
name
chevrolet chevelle malibu False
buick skylark 320 False
...
ford ranger False
chevy s-10 False
Length: 398, dtype: bool
```

Select the rows where the mask is True:

```
df_mpg_max = df_mpg[mask_max].copy()
df_mpg_max
```

And add some styling:

```
df_mpg_max.style.format('{:.0f}').background_gradient()
```

To understand the reasoning behind the previous example, read the rest of the article, where we explain the logic from the most basic example to locating data based on the index.

By now, we should know the difference between the brackets `[]`

and the parenthesis `()`

.

We use brackets to select parts of an object. For example, let's create a list of days:

```
list_days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
```

And select the second element:

```
list_days[1]
```

```
'Tuesday'
```

Or the last element:

```
list_days[-1]
```

```
'Sunday'
```

Until the third element (included):

```
list_days[:3]
```

```
['Monday', 'Tuesday', 'Wednesday']
```

Nevertheless, the `list`

is a simple element of Python. To get more functionalities, we use the `Series`

object from `pandas`

library.

Let's create a `Series`

to store the **Apple Stock Return on Investment (ROI)** by quarters:

```
import pandas as pd
sr_apple = pd.Series(
data=[59.02, 63.57, 66.93, 69.05],
index=['1Q', '2Q', '3Q', '4Q']
)
sr_apple
```

```
1Q 59.02
2Q 63.57
3Q 66.93
4Q 69.05
dtype: float64
```

`iloc`

(integer-location) propertyWe use `.iloc[]`

to select parts of the object based on the integer position of the element.

For example, let's select the first quarter ROI:

```
sr_apple.iloc[0]
```

```
59.02
```

Now, let's select the first and third quarters:

To select more than one object, we need to use double brackets `[[]]`

:

```
sr_apple.iloc[[0,2,3]]
```

```
1Q 59.02
3Q 66.93
4Q 69.05
dtype: float64
```

Could we have accessed with the name `1Q`

?

```
sr_apple.iloc['Q1']
```

```
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [99], in <cell line: 1>()
----> 1 sr_apple.iloc['Q1']
File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexing.py:967, in _LocationIndexer.__getitem__(self, key)
964 axis = self.axis or 0
966 maybe_callable = com.apply_if_callable(key, self.obj)
--> 967 return self._getitem_axis(maybe_callable, axis=axis)
File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexing.py:1517, in _iLocIndexer._getitem_axis(self, key, axis)
1515 key = item_from_zerodim(key)
1516 if not is_integer(key):
-> 1517 raise TypeError("Cannot index by location index with a non-integer key")
1519 # validate the location
1520 self._validate_integer(key, axis)
TypeError: Cannot index by location index with a non-integer key
```

The `iloc`

property only works in `integers`

(the position of the subelements we want).

To select the elements by their **label/name**, we need to use the `loc`

property:

`loc`

(location) propertyWe select parts of an object with the `.loc[]`

instance based on the **label/name** of the `index`

:

```
sr_apple.loc['1Q']
```

```
59.02
```

```
sr_apple.loc[['1Q', '3Q', '4Q']]
```

```
1Q 59.02
3Q 66.93
4Q 69.05
dtype: float64
```

If we would like to access by the position, we'd get an error:

```
sr_apple.loc[0]
```

```
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3621, in Index.get_loc(self, key, method, tolerance)
3620 try:
-> 3621 return self._engine.get_loc(casted_key)
3622 except KeyError as err:
File ~/miniforge3/lib/python3.9/site-packages/pandas/_libs/index.pyx:136, in pandas._libs.index.IndexEngine.get_loc()
File ~/miniforge3/lib/python3.9/site-packages/pandas/_libs/index.pyx:163, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:5198, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:5206, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 0
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Input In [102], in <cell line: 1>()
----> 1 sr_apple.loc[0]
File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexing.py:967, in _LocationIndexer.__getitem__(self, key)
964 axis = self.axis or 0
966 maybe_callable = com.apply_if_callable(key, self.obj)
--> 967 return self._getitem_axis(maybe_callable, axis=axis)
File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexing.py:1202, in _LocIndexer._getitem_axis(self, key, axis)
1200 # fall thru to straight lookup
1201 self._validate_key(key, axis)
-> 1202 return self._get_label(key, axis=axis)
File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexing.py:1153, in _LocIndexer._get_label(self, label, axis)
1151 def _get_label(self, label, axis: int):
1152 # GH#5667 this will fail if the label is not present in the axis.
-> 1153 return self.obj.xs(label, axis=axis)
File ~/miniforge3/lib/python3.9/site-packages/pandas/core/generic.py:3864, in NDFrame.xs(self, key, axis, level, drop_level)
3862 new_index = index[loc]
3863 else:
-> 3864 loc = index.get_loc(key)
3866 if isinstance(loc, np.ndarray):
3867 if loc.dtype == np.bool_:
File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3623, in Index.get_loc(self, key, method, tolerance)
3621 return self._engine.get_loc(casted_key)
3622 except KeyError as err:
-> 3623 raise KeyError(key) from err
3624 except TypeError:
3625 # If we have a listlike key, _check_indexing_error will raise
3626 # InvalidIndexError. Otherwise we fall through and re-raise
3627 # the TypeError.
3628 self._check_indexing_error(key)
KeyError: 0
```

It results in `KeyError`

because we don't have any `Key`

in the `index`

to be `0`

:

```
sr_apple
```

```
1Q 59.02
2Q 63.57
3Q 66.93
4Q 69.05
dtype: float64
```

We have:

```
sr_apple.keys()
```

```
Index(['1Q', '2Q', '3Q', '4Q'], dtype='object')
```

The `loc`

property only works **with the labels, not the position**.

Now we'd like to select parts based on a condition. For example, let's show the quarters we had a Return on Investment (ROI) above 60.

First, we create a boolean object based on the stated condition:

```
sr_apple
```

```
1Q 59.02
2Q 63.57
3Q 66.93
4Q 69.05
dtype: float64
```

```
sr_apple > 60
```

```
1Q False
2Q True
3Q True
4Q True
dtype: bool
```

```
mask_60 = sr_apple > 60
```

Now we pass the previous object to the `.loc`

property:

```
sr_apple.loc[mask_60]
```

```
2Q 63.57
3Q 66.93
4Q 69.05
dtype: float64
```

And here, we have the data for which the ROI is higher than 60.

`[]`

```
sr_apple
```

```
1Q 59.02
2Q 63.57
3Q 66.93
4Q 69.05
dtype: float64
```

We could also access the data by only using the brackets, without the ~`.iloc`

~ property:

```
sr_apple['1Q']
```

```
59.02
```

And also, the position:

```
sr_apple[0]
```

```
59.02
```

And the mask:

```
sr_apple[mask_60]
```

```
2Q 63.57
3Q 66.93
4Q 69.05
dtype: float64
```

So far, we have played with **1-Dimensional** objects. Now it's time to level up and play with **2-Dimensional** objects, like the `DataFrame`

.

Let's play with a dataset of cars:

```
import seaborn as sns
df_mpg = sns.load_dataset(name='mpg', index_col='name')
df_mpg
```

`iloc`

(integer-location) propertyWe can select the second row:

```
df_mpg.iloc[2]
```

```
mpg 18.0
cylinders 8
displacement 318.0
horsepower 150.0
weight 3436
acceleration 11.0
model_year 70
origin usa
Name: plymouth satellite, dtype: object
```

And keep the `DataFrame`

style if we use double brackets `[[]]`

:

```
df_mpg.iloc[[2]]
```

We can also **slice** (a term used for filtering as well) consecutive elements of the DataFrame with the **colon** `:`

.

For example, let's select the first 4 rows:

```
df_mpg.iloc[:4]
```

Instead of:

```
df_mpg.iloc[[0,1,2,3]]
```

We can also select the columns we want.

For example, let's select the first 3 columns:

```
df_mpg.iloc[:4, :3]
```

Or the rest of the columns from the 3rd position (not included):

```
df_mpg.iloc[:4, 3:]
```

Or the last 3 columns by using the `-`

:

```
df_mpg.iloc[:4, -3:]
```

`loc`

(location) propertyWe can also select parts of the DataFrame based on the **index and column labels** (2-Dimensions):

```
df_mpg.loc[['ford torino', 'fiat 124 sport coupe'], ['origin', 'model_year', 'cylinders']]
```

```
df_mpg.loc[:'fiat 124 sport coupe', :'cylinders']
```

Out of all the cars:

```
df_mpg.index
```

```
Index(['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth satellite',
'amc rebel sst', 'ford torino', 'ford galaxie 500', 'chevrolet impala',
'plymouth fury iii', 'pontiac catalina', 'amc ambassador dpl',
...
'chrysler lebaron medallion', 'ford granada l', 'toyota celica gt',
'dodge charger 2.2', 'chevrolet camaro', 'ford mustang gl', 'vw pickup',
'dodge rampage', 'ford ranger', 'chevy s-10'],
dtype='object', name='name', length=398)
```

We could select all the **fiat** cars if we had a boolean array based on this condition:

```
mask_fiat = df_mpg.index.str.contains('fiat')
mask_fiat
```

```
array([False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, True, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, True, False, False,
True, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, True, True, False, False, True, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, True, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, True, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False])
```

We can observe a few `True`

s where we find some **Fiats**.

Let's filter them and show all the columns with the `:`

:

```
df_mpg.loc[mask_fiat, :]
```

Although we could have omitted the `:`

:

```
df_mpg.loc[mask_fiat]
```

`&`

Just the fiats whose horsepower is above 80:

```
mask_hp = df_mpg.horsepower > 80
mask_hp
```

```
name
chevrolet chevelle malibu True
buick skylark 320 True
...
ford ranger False
chevy s-10 True
Name: horsepower, Length: 398, dtype: bool
```

```
df_mpg.loc[mask_hp & mask_fiat, :]
```

`|`

We could also select all fiats **OR** cars whose horsepower is above 80:

```
df_mpg.loc[mask_hp | mask_fiat, :]
```

`[]`

We can select the columns by their labels:

```
df_mpg['acceleration']
```

```
name
chevrolet chevelle malibu 12.0
buick skylark 320 11.5
...
ford ranger 18.6
chevy s-10 19.4
Name: acceleration, Length: 398, dtype: float64
```

```
df_mpg[['acceleration', 'origin', 'model_year']]
```

But we can't select the rows by the index labels:

```
df_mpg['amc rebel sst']
```

```
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3621, in Index.get_loc(self, key, method, tolerance)
3620 try:
-> 3621 return self._engine.get_loc(casted_key)
3622 except KeyError as err:
...
File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3623, in Index.get_loc(self, key, method, tolerance)
3621 return self._engine.get_loc(casted_key)
3622 except KeyError as err:
-> 3623 raise KeyError(key) from err
3624 except TypeError:
3625 # If we have a listlike key, _check_indexing_error will raise
3626 # InvalidIndexError. Otherwise we fall through and re-raise
3627 # the TypeError.
3628 self._check_indexing_error(key)
KeyError: 'amc rebel sst'
```

Unless we use the colon `:`

:

```
df_mpg[:'amc rebel sst']
```

```
df_mpg['buick skylark 320':'amc rebel sst']
```

We can also select the rows by position:

```
df_mpg[:4]
```

But we can't select both rows and columns (2-Dimensions):

```
df_mpg[:4,:3]
```

```
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3621, in Index.get_loc(self, key, method, tolerance)
3620 try:
-> 3621 return self._engine.get_loc(casted_key)
3622 except KeyError as err:
...
File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexes/base.py:5637, in Index._check_indexing_error(self, key)
5633 def _check_indexing_error(self, key):
5634 if not is_scalar(key):
5635 # if key is not a scalar, directly raise an error (the code below
5636 # would convert to numpy arrays and raise later any way) - GH29926
-> 5637 raise InvalidIndexError(key)
InvalidIndexError: (slice(None, 4, None), slice(None, 3, None))
```

Unless we specify the columns we want in extra brackets:

```
df_mpg[:4]['acceleration']
```

```
name
chevrolet chevelle malibu 12.0
buick skylark 320 11.5
plymouth satellite 11.0
amc rebel sst 12.0
Name: acceleration, dtype: float64
```

```
df_mpg[:4][['acceleration']]
```

```
df_mpg[:4][['acceleration', 'origin']]
```

We can also select the rows given *boolean-arrays* (a.k.a. **masks**):

```
df_mpg[mask_fiat]
```

```
df_mpg[mask_fiat | mask_hp]
```

```
df_mpg[mask_fiat & mask_hp]
```

It doesn't mean that I cannot later select the columns that we want (programming is the art of everything, we just need to find a way):

```
df_mpg[mask_fiat & mask_hp]['mpg']
```

```
name
fiat 124 sport coupe 26.0
fiat 131 28.0
Name: mpg, dtype: float64
```

```
df_mpg[mask_fiat & mask_hp][['mpg', 'origin', 'model_year']]
```

Everything may be a bit confusing, but we hope you get the main idea behind `locating`

and `masking`

:

- Select the parts of an object with brackets
`[]`

- We can access it through
- The label/name
`loc`

- The integer position
`iloc`

- Masks:
*boolean arrays*based on conditions - Just the brackets
`[]`

*

- The label/name
- If the object has:
- 1-Dimension
`object[:]`

- 2-Dimension
`object[:,:]`

- 1-Dimension

*Carefully because it has many variations of use case, as we observed above

Let's load a dataset with various categorical columns since we summarise data based on categories, not numbers.

```
df_tips = sns.load_dataset(name='tips')
df_tips
```

Let's make a pivot table to summarise the information to obtain a Hierarchical* DataFrame.

```
dfres = df_tips.pivot_table(index=['smoker', 'time'], columns='sex', aggfunc='size')
dfres
```

*A Hierarchical DataFrame (MultiIndex) contains two "columns" as an index. As we may observe below:

```
dfres.index
```

```
MultiIndex([('Yes', 'Lunch'),
('Yes', 'Dinner'),
( 'No', 'Lunch'),
( 'No', 'Dinner')],
names=['smoker', 'time'])
```

Let's locate some parts of the Hierarchical DataFrame:

```
dfres
```

By using the `.loc`

property:

```
dfres.loc['Yes', :]
```

```
dfres.loc['No', :]
```

As we have multiple indexes `[index1, index2, columns]`

, we can select a part of the second index:

```
dfres.loc[:, 'Lunch', :]
```

```
dfres.loc[:, 'Dinner', :]
```

Let's now play with a DataFrame that is both `MultiIndex`

and `MultiColumns`

:

```
dfres = df_tips.pivot_table(index=['smoker', 'time'], columns=['sex', 'day'], aggfunc='size')
dfres
```

We may observe two levels in the columns above.

`loc`

(location) propertyWe apply the same reasoning we used in the previous sections, `[index1, index2, column1, column2]`

.

```
dfres.loc['No', :, :, :]
```

Although, we can make it shorter.

```
dfres.loc['No', :]
```

The same applies to the second index:

```
dfres.loc[:,'Dinner', :, :]
```

```
dfres.loc[:,'Dinner', :]
```

Let's try to get Dinners on Sundays:

```
dfres.loc[:, 'Dinner', :, 'Sun']
```

```
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Input In [158], in <cell line: 1>()
----> 1 dfres.loc[:, 'Dinner', :, 'Sun']
File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexing.py:961, in _LocationIndexer.__getitem__(self, key)
959 if self._is_scalar_access(key):
960 return self.obj._get_value(*key, takeable=self._takeable)
--> 961 return self._getitem_tuple(key)
962 else:
963 # we by definition only have the 0th axis
964 axis = self.axis or 0
...
File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexes/frozen.py:70, in FrozenList.__getitem__(self, n)
68 if isinstance(n, slice):
69 return type(self)(super().__getitem__(n))
---> 70 return super().__getitem__(n)
IndexError: list index out of range
```

To make it work, this time we need to create an intermediate object to separate rows and columns:

```
idx = pd.IndexSlice
dfres.loc[idx[:, 'Dinner'], idx[:, 'Sun']]
```

```
dfres.loc[idx[:, 'Dinner'], idx['Male', :]]
```

We can also use the `slice()`

property:

```
dfres.loc[('Yes', slice(None)), (slice(None), 'Sun')]
```

```
dfres.loc['Yes', ('Female', slice(None))]
```

```
dfres.loc[(slice(None), 'Lunch'), 'Female']
```

```
dfres.loc[(slice(None), 'Lunch'), ('Female', slice(None))]
```

```
dfres.loc[idx[:, 'Dinner'], idx['Female', :]]
```

`iloc`

(integer-location) property```
dfres
```

As always, we can select by the position of the values with the `iloc`

property:

```
dfres.iloc[:2, :2]
```

```
dfres.iloc[:2, 2:]
```

Now, we will use a DataFrame that has a `DateTimeIndex`

:

```
df_tsla = pd.read_excel('tsla_stock.xlsx', index_col=0)
df_tsla
```

`loc`

(location) propertyWe can select parts of the DataFrame based on just one part of the `DateTimeIndex`

. For example, we can select everything from the year 2020 and move forward:

```
df_tsla.loc['2020':]
```

Until the last day of 2020:

```
df_tsla.loc[:'2020']
```

Between two years:

```
df_tsla.loc['2019':'2020']
```

One complete year:

```
df_tsla.loc['2019']
```

We can even select a specific `year-month`

:

```
df_tsla.loc['2019-06']
```

`iloc`

(integer-location) propertyOf course, we can also select parts of the DataFrame based on the position of the values with `iloc`

:

```
df_tsla.iloc[:4, :3]
```

```
df_tsla.iloc[-4:, :3]
```

]]> We lack a **framework** to solve problems

Introducing the *Resolving Python Method*, a framework you can master to reach your solutions faster without even looking at Google.

Stick to the following statement every time you need code to solve a problem:

Programming is nothing more than applying functions to objects to transform them into other objects.

```
Function (Object A) -> Object B
```

For example, to develop a Machine Learning model, you pass the DataFrame (`Object A`

) to fit (`Function`

) the Mathematical Equation (`Object B`

).

Even though functions are inside libraries, they don't perform the algorithms (libraries just store functions). Therefore, your first approach should be:

Which `Function()`

do we need to transform `Object A`

into `Object B`

?

In Python, we should know we have three ways to access functions:

`object.function()`

`library.function()`

`built_in_function()`

**Most of the functions we use in a Python script come from the Object.**

If we press the `[tab]`

key after the dot `.`

, Python suggests a list of the functions we can use from the Object.

```
object. + [tab]
```

If the function we are looking for doesn't appear in the list, we think about the library in which the function might be.

```
import library
library. + [tab]
```

Each library contains functions of a specific topic.

- NumPy: Mathematical Operations with Numbers
- Scikit-Learn: Machine Learning algorithms
- Matplotlib: Mathematical Plots
- Seaborn: Plots
- Pandas: Data Analysis

Take a look at this article to understand why Python is the programming language of the present and the future.

Okay, you've already got a reason to learn Python: you will have more chances because Python-related job offers will grow more and more over the coming years.

As the previous article states, you may end up in a job earning 65k a year; you may not worry so much about money when making decisions as you move forward in life!

IT Jobs Watch, a website that specialises in collating salary data across the IT industry, states that the median annual salary in the UK for a role requiring Python skills is 65,000.

Before getting there, keep your feet on Earth because you need to master Python.

Don't think you need a Computer Science degree or a Data Science master's to master Python before getting the job.

The best motivation to learn anything is that they pay you for it. Therefore, it would help to prioritise getting a job where you increase your Python skills.

Companies care about getting shit done. Therefore, you must show them what you know about programming and how you program.

Complete online courses to show what you know with certificates

Solve their assignments

Get your data and experiment with the learnt concepts

Showcase your knowledge on GitHub to show how you program (you may look at Edo's profile to see his portfolio)

He followed our advice and got a job in two months. He applied to around a hundred job offers on LinkedIn where recruiters could see his certifications.

Make it easy at the beginning with easy-to-understand Python code.

Some people use scripts to code. It'd help if you turned to the notebook format because you could see the output of every line right away. Follow this tutorial to install Jupyter Lab, the best program to work with notebooks and write your Python's first lines of code.

I have found Data Visualization to be the best starting topic because you immediately see how the output changes as you change the code. It gives you a massive dope of energy.

Follow this Data Visualization tutorial to get a complete overview of Data Visualization development in Python. Then, play around with the lines of code: add more data points to the plots or change the colour of the figures.

Once you are motivated and comfortable using Python, it is time to follow a proper learning path.

You can follow any roadmaps, but please make sure you don't overestimate your skills and start developing Neural Networks if you don't know how to create a simple Linear Regression.

You can follow Edo's roadmap by looking at his certifications:

You can also read the following thread, where I placed links to practical exercises you can use in your portfolio.

]]>The time has come to add another layer to the hierarchy of Machine Learning models.

Do we have the variable we want to predict in the dataset?

YES: **Supervised Learning**

- Predicting a Numerical Variable Regression
- Predicting a Categorical Variable Classification

NO: **Unsupervised Learning**

- Group Data Points based on Explanatory Variables Cluster Analysis

We may have, for example, all football players, and we want to group them based on their performance. But we don't know the groups beforehand. So what do we do then?

We apply Unsupervised Machine Learning models to group the players based on their position in the space (determined by the explanatory variables): the closer the players are to the space, the more likely they'll be drawn to the same group.

Another typical example comes from e-commerces that don't know if their customers like clothing or tech. But they know how they interact on the website. Therefore, they group the customers to send promotional emails that align with their likes.

In short, we close the circle with the different types of Machine Learning models by adding this new type.

Let's now develop the Python code.

Imagine for a second you are the President of the United States of America, and you are considering creating campaigns to reduce **car accidents due to alcohol** consumption controlling the impact of **insurance companies' losses** (columns).

You won't create 51 TV campaigns for each of the **States of the USA** (rows). Instead, you will see which States behave similarly to cluster them into three groups.

```
import seaborn as sns #!
import pandas as pd
df_crashes = sns.load_dataset(name='car_crashes', index_col='abbrev')[['alcohol', 'ins_losses']]
df_crashes
```

We don't have any missing data in any of the columns:

```
df_crashes.isna().sum()
```

```
alcohol 0
ins_losses 0
dtype: int64
```

Neither we need to convert categorical columns to *dummy variables* because the two we are considering are numerical.

```
df_crashes
```

We should know from previous chapters that we need a function accessible from a Class in the library `sklearn`

.

```
from sklearn.cluster import KMeans
```

Create a copy of the original code blueprint to not "modify" the source code.

```
model_km = KMeans()
```

The theoretical action we'd like to perform is the one we executed in previous chapters. Therefore, the function to compute the Machine Learning model should be called the same way:

```
model_km.fit()
```

```
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [6], in <cell line: 1>()
----> 1 model_km.fit()
TypeError: fit() missing 1 required positional argument: 'X'
```

The previous types of models asked for two parameters:

`y`

: target ~ independent ~ label ~ class variable`X`

: explanatory ~ dependent ~ feature variables

Why is it asking for just one parameter now, `X`

?

As we said before, this model (unsupervised learning) doesn't know how the groups are calculated beforehand; they know after we compute the Machine Learning model. Therefore, they don't need to see the target variable `y`

.

We don't need to separate the variables because we have explanatory ones.

```
model_km.fit(X=df_crashes)
```

```
KMeans()
```

We have a fitted `KMeans`

. Therefore, we should be able to apply the mathematical equation to the original data to get the predictions:

```
model_km.predict(X=df_crashes)
```

```
array([7, 3, 0, 7, 6, 7, 6, 1, 3, 7, 7, 5, 4, 7, 0, 5, 3, 3, 2, 4, 2, 3,
1, 3, 1, 7, 4, 5, 7, 5, 1, 5, 1, 3, 0, 3, 6, 0, 1, 1, 5, 4, 1, 1,
0, 0, 1, 0, 1, 0, 5], dtype=int32)
```

We wanted to calculate three groups, but Python is calculating eight groups. Let's modify this hyperparameter of the `KMeans`

model:

```
model_km = KMeans(n_clusters=3)
model_km.fit(X=df_crashes)
model_km.predict(X=df_crashes)
```

```
array([0, 0, 1, 0, 2, 0, 2, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 2, 1, 2, 0,
0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 2, 1, 0, 0, 1, 1, 0, 0,
1, 1, 0, 1, 0, 1, 1], dtype=int32)
```

Let's create a new `DataFrame`

to keep the original dataset untouched:

```
df_pred = df_crashes.copy()
```

And add the predictions:

```
df_pred['pred_km'] = model_km.predict(X=df_crashes)
df_pred
```

How can we see the groups in the plot?

Can you observe that the k-Means only considers the variable `ins_losses`

to determine the group the point belongs to? Why?

```
sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_km',
palette='Set1', data=df_pred);
```

The model measures the distance between the points. They seem to be spread around the plot but aren't; the plot doesn't place the points in perspective (it's lying to us).

Take a look at the following video to understand how the `KMeans`

algorithm computes the Mathematical Equation by **calculating distances**:

The model understands the data as follows:

```
import matplotlib.pyplot as plt
sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_km',
palette='Set1', data=df_pred)
plt.xlim(0, 200)
plt.ylim(0, 200);
```

Now it's evident why the model only took into account `ins_losses`

: it barely sees significant distances within `alcohol`

compared to `ins_losses`

.

In other words, with a metaphor: it is not the same to increase one kg of weight than one meter of height.

Then, how can we create a `KMeans`

model that compares the two variables equally?

- We need to scale the data (i.e., transforming the values into the same range: from 0 to 1) with the
`MinMaxScaler`

.

`MinMaxScaler()`

the dataAs with any other algorithm within the `sklearn`

library, we need to:

- Import the
`Class`

- Create the
`instance`

`fit()`

the numbers of the mathematical equation`predict/transform`

the data with the mathematical equation

```
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(df_crashes)
data = scaler.transform(df_crashes)
data[:5]
```

```
array([[0.47921847, 0.55636883],
[0.34718769, 0.45684192],
[0.42806394, 0.24636258],
[0.50100651, 0.5323574 ],
[0.20923623, 0.73980184]])
```

To better understand the information, let's convert the `array`

into a `DataFrame`

:

```
df_scaled = pd.DataFrame(data, columns=df_crashes.columns, index=df_crashes.index)
df_scaled
```

```
model_km.fit(X=df_scaled)
```

```
KMeans(n_clusters=3)
```

We have a fitted `KMeans`

. Therefore, we should be able to apply the mathematical equation to the original data to get the predictions:

```
model_km.predict(X=df_scaled)
```

```
array([1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 2, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0,
1, 0, 1, 1, 2, 0, 1, 0, 1, 0, 1, 0, 2, 0, 1, 0, 1, 1, 2, 2, 1, 1,
0, 0, 1, 0, 1, 0, 0], dtype=int32)
```

```
df_pred['pred_km_scaled'] = model_km.predict(X=df_scaled)
df_pred
```

We can observe now that both `alcohol`

and `ins_losses`

are taken into account by the model to calculate the cluster a point belongs to.

```
sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_km_scaled',
palette='Set1', data=df_pred);
```

From now on, we should understand that every time a model calculates distances between variables of different numerical ranges, we need to scale the data to compare them properly.

The following figure gives an overview of everything that has happened so far:

```
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(14, 7))
sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_km',
data=df_pred, palette='Set1', ax=ax1);
sns.scatterplot(x='alcohol', y='ins_losses', hue=df_pred.pred_km_scaled,
data=df_scaled, palette='Set1', ax=ax2);
sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_km',
data=df_pred, palette='Set1', ax=ax3);
sns.scatterplot(x='alcohol', y='ins_losses', hue=df_pred.pred_km_scaled,
data=df_scaled, palette='Set1', ax=ax4);
ax3.set_xlim(0, 200)
ax3.set_ylim(0, 200)
ax4.set_xlim(0, 200)
ax4.set_ylim(0, 200)
ax1.set_title('KMeans w/ Original Data & Liar Plot')
ax2.set_title('KMeans w/ Scaled Data & Perspective Plot')
ax3.set_title('KMeans w/ Original Data & Perspective Plot')
ax4.set_title('KMeans w/ Original Data & Perspective Plot')
plt.tight_layout()
```

`Clustering`

Models in PythonVisit the `sklearn`

website to see how many different clustering methods are and how they differ from each other.

Let's **pick two new models** and compute them:

We follow the same procedure as for any Machine Learning model from the Scikit-Learn library:

```
from sklearn.cluster import AgglomerativeClustering
model_ac = AgglomerativeClustering(n_clusters=3)
model_ac.fit(df_scaled)
```

```
AgglomerativeClustering(n_clusters=3)
```

```
model_ac.fit_predict(X=df_scaled)
```

```
array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 2, 1, 0, 1, 0, 1, 0, 1, 2, 0, 0, 1, 0, 0, 2, 1, 0, 0,
0, 1, 0, 1, 0, 1, 1])
```

```
df_pred['pred_ac'] = model_ac.fit_predict(X=df_scaled)
df_pred
```

We can observe how the second group takes three points with the Agglomerative Clustering while the KMeans gather five points in the second group.

As they are different algorithms, they are expected to produce different results. If you'd like to understand which models you should use, you may know how the algorithm works. We don't explain it in this series because we want to make it simple.

```
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))
sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_km_scaled',
data=df_pred, palette='Set1', ax=ax1);
sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_ac',
data=df_pred, palette='Set1', ax=ax2)
ax1.set_title('KMeans')
ax2.set_title('Agglomerative Clustering');
```

We follow the same procedure as for any Machine Learning model from the Scikit-Learn library:

```
from sklearn.cluster import SpectralClustering
model_sc = SpectralClustering(n_clusters=3)
model_sc.fit(df_scaled)
```

```
SpectralClustering(n_clusters=3)
```

```
model_sc.fit_predict(X=df_scaled)
```

```
array([0, 2, 2, 0, 0, 2, 0, 0, 2, 0, 0, 1, 2, 0, 2, 2, 2, 0, 0, 2, 0, 2,
0, 2, 0, 0, 1, 2, 0, 2, 0, 2, 0, 2, 1, 2, 0, 2, 0, 0, 1, 1, 0, 0,
2, 2, 0, 2, 0, 2, 2], dtype=int32)
```

```
df_pred['pred_sc'] = model_sc.fit_predict(X=df_scaled)
df_pred
```

Let's visualize all models together and appreciate the minor differences because they cluster the groups differently.

```
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 5))
ax1.set_title('KMeans')
sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_km_scaled',
data=df_pred, palette='Set1', ax=ax1);
ax2.set_title('Agglomerative Clustering')
sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_ac',
data=df_pred, palette='Set1', ax=ax2);
ax3.set_title('Spectral Clustering')
sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_sc',
data=df_pred, palette='Set1', ax=ax3);
```

Once again, you don't need to know the maths behind every Machine Learning model to build them. However, I hope you are getting a sense of the patterns behind the Scikit-Learn library with this series of tutorials.

Let's arbitrarily choose the Agglomerative Clustering as our model and get back to you being the President of the USA. How would you describe the groups?

- Higher
`ins_losses`

and lower`alcohol`

- Lower
`ins_losses`

and lower`alcohol`

- Lower
`ins_losses`

and higher`alcohol`

```
sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_ac', data=df_pred, palette='Set1');
```

You would create different messages on the TV campaigns for the three groups separately and avoid deploying many more resources to develop fifty-one various TV campaigns (one for each State), which doesn't make sense because many of them are similar.

]]>Ask him any doubt on **Twitter** or **LinkedIn**

Look at the following example as an aspiration you can achieve if you fully understand and replicate this whole tutorial with your data.

Let's load a dataset containing information on the Tesla Stock daily (rows) transactions (columns) in the Stock Market.

```
import pandas as pd
url = 'https://raw.githubusercontent.com/jsulopzs/data/main/tsla_stock.csv'
df_tesla = pd.read_csv(url, index_col=0, parse_dates=['Date'])
df_tesla
```

You may calculate the `.mean()`

of each column by the last Business day of each Month (BM):

```
df_tesla.resample('BM').mean()
```

Or the Weekly Average:

```
df_tesla.resample('W-FRI').mean()
```

And many more; see the full list here.

Pretty straightforward compared to other libraries and programming languages.

It's not a casualty they say Python is the future language because its libraries simplify many operations where most people believe they would have needed a `for`

loop.

Let's apply other pandas techniques to the DateTime object:

```
df_tesla['year'] = df_tesla.index.year
df_tesla['month'] = df_tesla.index.month
```

The following values represent the average Close price by each month-year combination:

```
df_tesla.pivot_table(index='year', columns='month', values='Close', aggfunc='mean').round(2)
```

We could even style it to get a better insight by colouring the cells:

```
df_stl = df_tesla.pivot_table(
index='year',
columns='month',
values='Close',
aggfunc='mean',
fill_value=0).style.format('{:.2f}').background_gradient(axis=1)
df_stl
```

And they represent the volatility with the standard deviation:

```
df_stl = df_tesla.pivot_table(
index='year',
columns='month',
values='Close',
aggfunc='std',
fill_value=0).style.format('{:.2f}').background_gradient(axis=1)
df_stl
```

In this article, we'll dig into the details of the Panda's DateTime-related object in Python to understand the required knowledge to come up with awesome calculations like the ones we saw above.

First, let's reload the dataset to start from the basics.

```
df_tesla = pd.read_csv(url, parse_dates=['Date'])
df_tesla
```

An essential part of learning something is the practicability and the understanding of counterexamples where we understand the errors.

Let's go with basic thinking to understand the importance of the DateTime object and how to work with it. So, out of all the columns in the DataFrame, we'll now focus on `Date`

:

```
df_tesla.Date
```

```
0 2017-01-03
1 2017-01-04
...
1378 2022-06-24
1379 2022-06-27
Name: Date, Length: 1380, dtype: datetime64[ns]
```

What information could we get from a `DateTime`

object?

- We may think we can get the month, but it turns out we can't in the following manner:

```
df_tesla.Date.month
```

```
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Input In [53], in <cell line: 1>()
----> 1 df_tesla.Date.month
File ~/miniforge3/lib/python3.9/site-packages/pandas/core/generic.py:5575, in NDFrame.__getattr__(self, name)
5568 if (
5569 name not in self._internal_names_set
5570 and name not in self._metadata
5571 and name not in self._accessors
5572 and self._info_axis._can_hold_identifiers_and_holds_name(name)
5573 ):
5574 return self[name]
-> 5575 return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'month'
```

Programming exists to simplify our lives, not make them harder.

Someone has probably developed a simpler functionality if you think there must be a simpler way to perform certain operations. Therefore, don't limit programming applications to complex ideas and rush towards a `for`

loop, for example; proceed through trial and error without losing hope.

In short, we need to bypass the `dt`

instance to access the `DateTime`

functions:

```
df_tesla.Date.dt
```

```
<pandas.core.indexes.accessors.DatetimeProperties object at 0x16230a2e0>
```

```
df_tesla.Date.dt.month
```

```
0 1
1 1
..
1378 6
1379 6
Name: Date, Length: 1380, dtype: int64
```

We can use more elements than just `.month`

:

```
df_tesla.Date.dt.month_name()
```

```
0 January
1 January
...
1378 June
1379 June
Name: Date, Length: 1380, dtype: object
```

```
df_tesla.Date.dt.isocalendar()
```

```
df_tesla.Date.dt.quarter
```

```
0 1
1 1
..
1378 2
1379 2
Name: Date, Length: 1380, dtype: int64
```

```
df_tesla.Date.dt.to_period('M')
```

```
0 2017-01
1 2017-01
...
1378 2022-06
1379 2022-06
Name: Date, Length: 1380, dtype: period[M]
```

```
df_tesla.Date.dt.to_period('W-FRI')
```

```
0 2016-12-31/2017-01-06
1 2016-12-31/2017-01-06
...
1378 2022-06-18/2022-06-24
1379 2022-06-25/2022-07-01
Name: Date, Length: 1380, dtype: period[W-FRI]
```

Pandas contain functionality that allows us to place Time Zones into the objects to ease the work of data from different countries and regions.

Before getting deeper into Time Zones, we need to set the `Date`

as the `index`

(rows) of the `DataFrame`

:

```
df_tesla.set_index('Date', inplace=True)
df_tesla
```

We can tell Python the `DateTimeIndex`

of the `DataFrame`

comes from Madrid:

```
df_tesla.index = df_tesla.index.tz_localize('Europe/Madrid')
df_tesla
```

And **change** it to another Time Zone, like **Moscow**:

```
df_tesla.index.tz_convert('Europe/Moscow')
```

```
DatetimeIndex(['2017-01-03 02:00:00+03:00', '2017-01-04 02:00:00+03:00',
'2017-01-05 02:00:00+03:00', '2017-01-06 02:00:00+03:00',
...
'2022-06-22 01:00:00+03:00', '2022-06-23 01:00:00+03:00',
'2022-06-24 01:00:00+03:00', '2022-06-27 01:00:00+03:00'],
dtype='datetime64[ns, Europe/Moscow]', name='Date', length=1380, freq=None)
```

We could have applied the transformation in the `DataFrame`

object itself:

```
df_tesla.tz_convert('Europe/Moscow')
```

We can observe the hour has changed accordingly.

The **Pandas Time Zone** functionality is useful for combining timed data from different regions around the globe.

To summarise, for example, the information of daily operations into months, we can apply different functions with each one having its unique ability (it's up to you to select the one that suits your needs):

`.groupby()`

`.resample()`

`.pivot_table()`

Let's show some examples:

```
df_tesla.groupby(by=df_tesla.index.year).Volume.sum()
```

```
Date
2017 7950157000
2018 10808194000
2019 11540242000
2020 19052912400
2021 6902690500
2022 3407576732
Name: Volume, dtype: int64
```

The function `.groupby()`

packs the rows of the same year:

```
df_tesla.groupby(by=df_tesla.index.year)
```

```
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1622eecd0>
```

To later summarise the total volume in each pack as we saw before.

An easier way?

```
df_tesla.Volume.resample('Y').sum()
```

```
Date
2017-12-31 00:00:00+01:00 7950157000
2018-12-31 00:00:00+01:00 10808194000
2019-12-31 00:00:00+01:00 11540242000
2020-12-31 00:00:00+01:00 19052912400
2021-12-31 00:00:00+01:00 6902690500
2022-12-31 00:00:00+01:00 3407576732
Freq: A-DEC, Name: Volume, dtype: int64
```

We first select the column in which we want to apply the operation:

```
df_tesla.Volume
```

```
Date
2017-01-03 00:00:00+01:00 29616500
2017-01-04 00:00:00+01:00 56067500
...
2022-06-24 00:00:00+02:00 31866500
2022-06-27 00:00:00+02:00 21237332
Name: Volume, Length: 1380, dtype: int64
```

And apply the `.resample()`

function to take a Date Offset to aggregate the `DateTimeIndex`

. In this example, we aggregate by year `'Y'`

:

```
df_tesla.Volume.resample('Y')
```

```
<pandas.core.resample.DatetimeIndexResampler object at 0x16230abe0>
```

And apply mathematical operations to the aggregated objects separately as we saw before:

```
df_tesla.Volume.resample('Y').sum()
```

```
Date
2017-12-31 00:00:00+01:00 7950157000
2018-12-31 00:00:00+01:00 10808194000
2019-12-31 00:00:00+01:00 11540242000
2020-12-31 00:00:00+01:00 19052912400
2021-12-31 00:00:00+01:00 6902690500
2022-12-31 00:00:00+01:00 3407576732
Freq: A-DEC, Name: Volume, dtype: int64
```

We could have also calculated the `.sum()`

for all the columns if we didn't select just the `Volume`

:

```
df_tesla.resample('Y').sum()
```

As always, we should strive to represent the information in the clearest manner for anyone to understand. Therefore, we could even visualize the aggregated volume by year with two more words:

```
df_tesla.Volume.resample('Y').sum().plot.bar();
```

Let's now try different Date Offsets:

```
df_tesla.Volume.resample('M').sum()
```

```
Date
2017-01-31 00:00:00+01:00 503398000
2017-02-28 00:00:00+01:00 597700000
...
2022-05-31 00:00:00+02:00 649407200
2022-06-30 00:00:00+02:00 572380932
Freq: M, Name: Volume, Length: 66, dtype: int64
```

```
df_tesla.Volume.resample('M').sum().plot.line();
```

```
df_tesla.Volume.resample('W').sum()
```

```
Date
2017-01-08 00:00:00+01:00 142882000
2017-01-15 00:00:00+01:00 105867500
...
2022-06-26 00:00:00+02:00 141234200
2022-07-03 00:00:00+02:00 21237332
Freq: W-SUN, Name: Volume, Length: 287, dtype: int64
```

```
df_tesla.Volume.resample('W').sum().plot.area();
```

```
df_tesla.Volume.resample('W-FRI').sum()
```

```
Date
2017-01-06 00:00:00+01:00 142882000
2017-01-13 00:00:00+01:00 105867500
...
2022-06-24 00:00:00+02:00 141234200
2022-07-01 00:00:00+02:00 21237332
Freq: W-FRI, Name: Volume, Length: 287, dtype: int64
```

```
df_tesla.Volume.resample('W-FRI').sum().plot.line();
```

```
df_tesla.Volume.resample('Q').sum()
```

```
Date
2017-03-31 00:00:00+02:00 1636274500
2017-06-30 00:00:00+02:00 2254740000
...
2022-03-31 00:00:00+02:00 1678802000
2022-06-30 00:00:00+02:00 1728774732
Freq: Q-DEC, Name: Volume, Length: 22, dtype: int64
```

```
df_tesla.Volume.resample('Q').sum().plot.bar();
```

We can also use Pivot Tables for summarising and nicer represent the information:

```
df_res = df_tesla.pivot_table(
index=df_tesla.index.month,
columns=df_tesla.index.year,
values='Volume',
aggfunc='sum'
)
df_res
```

And even apply some style to get more insight on the DataFrame:

```
df_tesla['Volume_M'] = df_tesla.Volume/1_000_000
dfres = df_tesla.pivot_table(
index=df_tesla.index.month,
columns=df_tesla.index.year,
values='Volume_M',
aggfunc='sum'
)
df_stl = dfres.style.format('{:.2f}').background_gradient('Reds', axis=1)
df_stl
```

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

We have already covered:

- Regression Models
- Classification Models
- Train Test Split for Model Selection

In short, we have computed all possible types of models to predict numerical and categorical variables with Regression and Classification models, respectively.

Although it is not enough to compute one model, we need to compare different models to choose the one whose predictions are close to reality.

Nevertheless, we cannot evaluate the model on the same data we used to `.fit()`

(train) the mathematical equation (model). Therefore, we need to separate the data into train and test sets; the first to train the model, the later to evaluate the model.

We add an extra layer of complexity because we can improve a model (an algorithm) by configuring its parameters. This chapter is about **computing different combinations of a single model's hyperparameters** to get the best.

- The goal of this dataset is
- To predict if
**bank's customers**(rows)`default`

next month - Based on their
**socio-demographical characteristics**(columns)

```
import pandas as pd
pd.set_option("display.max_columns", None)
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls'
df_credit = pd.read_excel(io=url, header=1, index_col=0)
df_credit.sample(10)
```

The function `.fit()`

needs all the cells in the DataFrame to contain a value. NaN means "Not a Number" (i.e., cell for which we don't have any information). Otherwise, it won't know how to process the row and compare it to others.

```
df_credit.isna().sum()
```

```
LIMIT_BAL 0
SEX 0
..
PAY_AMT6 0
default payment next month 0
Length: 24, dtype: int64
```

```
df_credit.isna().sum().sum()
```

```
0
```

The function `.fit()`

needs the values to be numeric. Otherwise, Python won't know the position of the axes in which to allocate the point.

Therefore, categories of the categorical columns will be transformed into new columns (one new column per category) and contain 1s and 0s depending on whether the person is or is not in the category.

Nevertheless, **we don't need to create dummy variables** because the data contains numerical variables only.

So far, we have used the naming standard of **target** and **features**. Nevertheless, the most common standards on the Internet are **X** and **y**. Let's get used to it:

```
y = df_credit.iloc[:, -1]
X = df_credit.iloc[:, :-1]
```

From the previous chapter, we should already know we need to separate the data into train and test if we want to evaluate the model's predictive capability for data we don't know yet.

In our case, we'd like to predict if new credit card customers won't commit default in the next month. As we don't have the data for the next month (it's the future), we need to apply the function `train_test_split()`

.

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
```

`DecisionTreeClassifier()`

with Default HyperparametersTo compute a Machine Learning model with the **default hyperparameters**, we apply the same procedure we have covered in previous chapters:

```
from sklearn.tree import DecisionTreeClassifier
model_dt = DecisionTreeClassifier()
model_dt.fit(X_train, y_train)
```

```
DecisionTreeClassifier()
```

We can see the model is almost perfect for predicting the training data (99% of accuracy). Nevertheless, predicting test data is terrible (72% of accuracy). This phenomenon tells us that the model is incurring in **overfitting**.

`train`

data```
model_dt.score(X_train, y_train)
```

```
0.9995024875621891
```

`test`

data```
model_dt.score(X_test, y_test)
```

```
0.7265656565656565
```

I'll use the following visualization to explain the concept of overfitting.

```
from sklearn.tree import plot_tree
plot_tree(decision_tree=model_dt, feature_names=X_train.columns, filled=True);
```

The tree is big because we have a lot of people (20,100), and we haven't set any limit on the model.

How many people do you think we have in the deepest leaf?

- Very few, probably one.

Are these people characteristic of the overall data? Or are they infrequent?

- Because they are infrequent and the model is very complex, we are incurring overfitting, and we get a vast difference between train and test accuracies.

`DecisionTreeClassifier()`

with Custom HyperparametersWhich hyperparameters can we configure for the Decision Tree algorithm?

In the output below, we can configure parameters such as `max_depth`

, `criterion`

and `min_samples_leaf`

, among others.

```
model = DecisionTreeClassifier()
model.get_params()
```

```
{'ccp_alpha': 0.0,
'class_weight': None,
'criterion': 'gini',
'max_depth': None,
'max_features': None,
'max_leaf_nodes': None,
'min_impurity_decrease': 0.0,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'random_state': None,
'splitter': 'best'}
```

Let's apply different random configurations to see how to model's accuracy changes in train and test sets.

Please pay attention to how the accuracies are similar when we reduce the model's complexity (we make the tree shorter and generalized to capture more people in the leaves).

And remember that we should pick up a good configuration based on the test accuracy.

```
model_dt = DecisionTreeClassifier(max_depth=2, min_samples_leaf=150)
model_dt.fit(X_train, y_train)
```

```
DecisionTreeClassifier(max_depth=2, min_samples_leaf=150)
```

`train`

data```
model_dt.score(X_train, y_train)
```

```
0.8186567164179105
```

`test`

data```
model_dt.score(X_test, y_test)
```

```
0.8215151515151515
```

```
plot_tree(decision_tree=model_dt, feature_names=X_train.columns, filled=True);
```

```
model_dt = DecisionTreeClassifier(max_depth=3)
model_dt.fit(X_train, y_train)
```

```
DecisionTreeClassifier(max_depth=3)
```

`train`

data```
model_dt.score(X_train, y_train)
```

```
0.8207960199004976
```

`test`

data```
model_dt.score(X_test, y_test)
```

```
0.8222222222222222
```

```
plot_tree(decision_tree=model_dt, feature_names=X_train.columns, filled=True);
```

```
model_dt = DecisionTreeClassifier(max_depth=4)
model_dt.fit(X_train, y_train)
```

```
DecisionTreeClassifier(max_depth=4)
```

`train`

data```
model_dt.score(X_train, y_train)
```

```
0.8232338308457712
```

`test`

data```
model_dt.score(X_test, y_test)
```

```
0.8205050505050505
```

```
plot_tree(decision_tree=model_dt, feature_names=X_train.columns, filled=True);
```

```
model_dt = DecisionTreeClassifier(min_samples_leaf=100)
model_dt.fit(X_train, y_train)
```

```
DecisionTreeClassifier(min_samples_leaf=100)
```

`train`

data```
model_dt.score(X_train, y_train)
```

```
0.8244278606965174
```

`test`

data```
model_dt.score(X_test, y_test)
```

```
0.8161616161616162
```

```
plot_tree(decision_tree=model_dt, feature_names=X_train.columns, filled=True);
```

```
model_dt = DecisionTreeClassifier(max_depth=7, min_samples_leaf=100)
model_dt.fit(X_train, y_train)
```

```
DecisionTreeClassifier(max_depth=7, min_samples_leaf=100)
```

`train`

data```
model_dt.score(X_train, y_train)
```

```
0.8237313432835821
```

`test`

data```
model_dt.score(X_test, y_test)
```

```
0.8177777777777778
```

```
plot_tree(decision_tree=model_dt, feature_names=X_train.columns, filled=True);
```

We have similar results; the accuracy goes around 82% on the test set when we configure a general model which doesn't have a considerable depth (as the first one).

But we should ask ourselves another question: can we do this process of automatically checking multiple combinations of hyperparameters?

- Yes, and that's where
**Cross Validation**gets in.

`GridSearchCV()`

to find Best HyperparametersThe Cross-Validation technique splits the training data into n number of folds (5 in the image below). Then, it computes each hyperparameter configuration n times, where each fold will be taken as a test set once.

Consider that we `.fit()`

a model as many times as folds are multiplied by the number of combinations we want to try.

Out of the Decision Tree hyperparameters:

```
model_dt = DecisionTreeClassifier()
model_dt.get_params()
```

```
{'ccp_alpha': 0.0,
'class_weight': None,
'criterion': 'gini',
'max_depth': None,
'max_features': None,
'max_leaf_nodes': None,
'min_impurity_decrease': 0.0,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'random_state': None,
'splitter': 'best'}
```

We want to try the following combinations of `max_depth (6)`

, `min_samples_leaf (7)`

and `criterion (2)`

:

```
from sklearn.model_selection import GridSearchCV
param_grid = {
'max_depth': [None, 2, 3, 4, 5, 10],
'min_samples_leaf': [1, 50, 100, 200, 400, 800, 1600],
'criterion': ['gini', 'entropy']
}
cv_dt = GridSearchCV(estimator=model_dt, param_grid=param_grid, cv=5, verbose=1)
```

They make up to 420 times we compute the function`.fit()`

```
5*6*7*2
```

```
420
```

To compare 84 different combinations of the Decision Tree hyperparameters:

```
6*7*2
```

```
84
```

```
cv_dt.fit(X_train, y_train)
```

```
Fitting 5 folds for each of 84 candidates, totalling 420 fits
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(),
param_grid={'criterion': ['gini', 'entropy'],
'max_depth': [None, 2, 3, 4, 5, 10],
'min_samples_leaf': [1, 50, 100, 200, 400, 800, 1600]},
verbose=1)
```

If we specify `verbose=2`

, we will see how many fits we perform in the output:

```
cv_dt = GridSearchCV(estimator=model_dt, param_grid=param_grid, cv=5, verbose=2)
cv_dt.fit(X_train, y_train)
```

```
Fitting 5 folds for each of 84 candidates, totalling 420 fits
[CV] END .criterion=gini, max_depth=None, min_samples_leaf=1; total time= 0.2s
[CV] END .criterion=gini, max_depth=None, min_samples_leaf=1; total time= 0.2s
...
[CV] END criterion=entropy, max_depth=10, min_samples_leaf=1600; total time= 0.1s
[CV] END criterion=entropy, max_depth=10, min_samples_leaf=1600; total time= 0.1s
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(),
param_grid={'criterion': ['gini', 'entropy'],
'max_depth': [None, 2, 3, 4, 5, 10],
'min_samples_leaf': [1, 50, 100, 200, 400, 800, 1600]},
verbose=2)
```

The best hyperparameter configuration is:

```
cv_dt.best_params_
```

```
DecisionTreeClassifier(criterion='entropy', max_depth=5, min_samples_leaf=100)
```

To achieve accuracy on the test set of:

```
cv_dt.score(X_test, y_test)
```

```
0.8186868686868687
```

If we'd like to have the results of every configuration:

```
df_cv_dt = pd.DataFrame(cv_dt.cv_results_)
df_cv_dt
```

Now let's try to find the best hyperparameter configuration of other models, which don't have the same hyperparameters as the Decision Tree because their algorithm and mathematical equation are different.

`SVC()`

Before computing the Support Vector Machines model, we need to scale the data because this model compares the distance between the explanatory variables. Therefore, they all need to be on the same scale.

```
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_norm = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
```

We need to separate the data again to have the train and test sets with the scaled data:

```
>>> X_norm_train, X_norm_test, y_train, y_test = train_test_split(
... X_norm, y, test_size=0.33, random_state=42)
```

The Support Vector Machines contain the following hyperparameters:

```
from sklearn.svm import SVC
sv = SVC()
sv.get_params()
```

```
{'C': 1.0,
'break_ties': False,
'cache_size': 200,
'class_weight': None,
'coef0': 0.0,
'decision_function_shape': 'ovr',
'degree': 3,
'gamma': 'scale',
'kernel': 'rbf',
'max_iter': -1,
'probability': False,
'random_state': None,
'shrinking': True,
'tol': 0.001,
'verbose': False}
```

From which we want to try the following combinations:

```
param_grid = {
'C': [0.1, 1, 10],
'kernel': ['linear', 'rbf']
}
cv_sv = GridSearchCV(estimator=sv, param_grid=param_grid, verbose=2)
cv_sv.fit(X_norm_train, y_train)
```

```
Fitting 5 folds for each of 6 candidates, totalling 30 fits
[CV] END ...............................C=0.1, kernel=linear; total time= 3.0s
[CV] END ...............................C=0.1, kernel=linear; total time= 3.0s
...
[CV] END ...................................C=10, kernel=rbf; total time= 5.3s
[CV] END ...................................C=10, kernel=rbf; total time= 5.3s
GridSearchCV(estimator=SVC(),
param_grid={'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']},
verbose=2)
```

We should notice that some fits take up to almost 5 seconds, which is very time-expensive if we want to try thousands of combinations (professionals apply these practices). Therefore, we should know how the model's algorithm works inside to choose a good hyperparameter configuration that doesn't devote much time. Otherwise, we make the company spend a lot of money on computing power.

This tutorial dissects the Support Vector Machines algorithm works inside.

The best hyperparameter configuration is:

```
cv_sv.best_params_
```

```
SVC(C=10)
```

To achieve an accuracy on the test set of:

```
cv_sv.score(X_norm_test, y_test)
```

```
0.8185858585858586
```

If we'd like to have the results of every configuration:

```
df_cv_sv = pd.DataFrame(cv_sv.cv_results_)
df_cv_sv
```

`KNeighborsClassifier()`

Now we'll compute another classification model: K Nearest Neighbours.

We check for its hyperparameters:

```
from sklearn.neighbors import KNeighborsClassifier
model_kn = KNeighborsClassifier()
model_kn.get_params()
```

```
{'algorithm': 'auto',
'leaf_size': 30,
'metric': 'minkowski',
'metric_params': None,
'n_jobs': None,
'n_neighbors': 5,
'p': 2,
'weights': 'uniform'}
```

To choose the following combinations:

```
param_grid = {
'leaf_size': [10, 20, 30, 50],
'metric': ['minkowski', 'euclidean', 'manhattan'],
'n_neighbors': [3, 5, 10, 20]
}
cv_kn = GridSearchCV(estimator=kn, param_grid=param_grid, verbose=2)
cv_kn.fit(X_norm_train, y_train)
```

```
Fitting 5 folds for each of 48 candidates, totalling 240 fits
[CV] END ......leaf_size=10, metric=minkowski, n_neighbors=3; total time= 1.5s
[CV] END ......leaf_size=10, metric=minkowski, n_neighbors=3; total time= 1.3s
...
[CV] END .....leaf_size=50, metric=manhattan, n_neighbors=20; total time= 1.1s
[CV] END .....leaf_size=50, metric=manhattan, n_neighbors=20; total time= 1.1s
GridSearchCV(estimator=KNeighborsClassifier(),
param_grid={'leaf_size': [10, 20, 30, 50],
'metric': ['minkowski', 'euclidean', 'manhattan'],
'n_neighbors': [3, 5, 10, 20]},
verbose=2)
```

The best hyperparameter configuration is:

```
cv_kn.best_params_
```

```
KNeighborsClassifier(leaf_size=10, n_neighbors=20)
```

To achieve an accuracy on the test set of:

```
cv_kn.score(X_norm_test, y_test)
```

```
0.8185858585858586
```

If we'd like to have the results of every configuration:

```
df_cv_kn = pd.DataFrame(cv_kn.cv_results_)
df_cv_kn
```

The best algorithm at its best is the Decision Tree Classifier:

```
dic_results = {
'model': [
cv_dt.best_estimator_,
cv_sv.best_estimator_,
cv_kn.best_estimator_
],
'hyperparameters': [
cv_dt.best_params_,
cv_sv.best_params_,
cv_kn.best_params_
],
'score': [
cv_dt.score(X_test, y_test),
cv_sv.score(X_norm_test, y_test),
cv_kn.score(X_norm_test, y_test)
]
}
df_cv_comp = pd.DataFrame(dic_results)
df_cv_comp.style.background_gradient()
```

]]>Ask him any doubt on **Twitter** or **LinkedIn**

Machine Learning models learn a mathematical equation from historical data.

Not all Machine Learning models predict the same way; some models are better than others.

We measure how good a model is by calculating its score (accuracy).

So far, we have calculated the model's score using the same data to fit (train) the mathematical equation. That's cheating. That's overfitting.

This tutorial compares 3 different models:

- Decision Tree
- Logistic Regression
- Support Vector Machines

We validate the models in 2 different ways:

- Using the same data during training
- Using 30% of the data; not used during training

To demonstrate how the selection of the best model changes if we are to validate the model with data not used during training.

For example, the image below shows the best model, when using the same data for validation, is the Decision Tree (0.86 of accuracy). Nevertheless, everything changes when the model is evaluated with data not used during training; the best model is the Logistic Regression (0.85 of accuracy). Whereas the Decision Tree only gets up to 0.80 of accuracy.

Were we a bank whose losses rank up to 1M USD due to 0.01 fail in accuracy, we would have lost 5M USD. This is something that happens in real life.

In short, banks are interested in good models to predict new potential customers. Not historical customers who have already gotten a loan and the bank knows if they were good to pay or not.

This tutorial shows you how to implement the `train_test_split`

technique to reduce overfitting with a practical use case where we want to classify whether a person used the Internet or not.

Load the dataset from CIS, executing the following lines of code:

```
import pandas as pd #!
df_internet = pd.read_excel('https://github.com/jsulopzs/data/blob/main/internet_usage_spain.xlsx?raw=true', sheet_name=1, index_col=0)
df_internet
```

- The goal of this dataset is
- To predict
`internet_usage`

of**people**(rows) - Based on their
**socio-demographical characteristics**(columns)

We should already know from the previous chapter that the data might be preprocessed before passing it to the function that computes the mathematical equation.

The function `.fit()`

all the cells in the DataFrame to contain a value. `NaN`

means "Not a Number" (i.e., cell for which we don't have any information). Otherwise, it won't know how to process the row and compare it to others.

For example, if you miss John's age, you cannot place John in the space to compare with other people because the point might be anywhere.

```
df_internet.isna().sum()
```

```
internet_usage 0
sex 0
age 0
education 0
dtype: int64
```

The function `.fit()`

needs the values to be numeric. Otherwise, Python won't know the position of the axes in which to allocate the point. For example, if you have *Male* and *Female*, at which distance do you separate them, and why? You cannot make an objective assessment unless you separate each category.

Therefore, categories of the categorical columns will be transformed into new columns (one new column per category) and contain 1s and 0s depending on whether the person is or is not in the category.

```
df_internet = pd.get_dummies(df_internet, drop_first=True)
df_internet
```

Once we have preprocessed the data, we select the column we want to predict (target) and the columns we will use to explain the prediction (features/explanatory).

```
target = df_internet.internet_usage
features = df_internet.drop(columns='internet_usage')
```

We should already know that the Machine Learning procedure is the same all the time:

- Computing a mathematical equation:
**fit** - To calculate predictions:
**predict** - And compare them to reality:
**score**

The only element that changes is the `Class()`

that contains lines of code of a specific algorithm (DecisionTreeClassifier, SVC, LogisticRegression).

`DecisionTreeClassifier()`

Model in Python```
from sklearn.tree import DecisionTreeClassifier
model_dt = DecisionTreeClassifier()
model_dt.fit(X=features, y=target)
model_dt.score(X=features, y=target)
```

```
0.859877800407332
```

`SVC()`

Model in Python```
from sklearn.svm import SVC
model_svc = SVC(probability=True)
model_svc.fit(X=features, y=target)
model_svc.score(X=features, y=target)
```

```
0.7837067209775967
```

`LogisticRegression()`

Model in Python```
from sklearn.linear_model import LogisticRegression
model_lr = LogisticRegression(max_iter=1000)
model_lr.fit(X=features, y=target)
model_lr.score(X=features, y=target)
```

```
0.8334012219959267
```

- We repeated all the time the same code:

```
model.fit()
model.score()
```

- Why not turn the lines into a
`function()`

to**automate the process**?

```
calculate_accuracy(model_dt)
calculate_accuracy(model_sv)
calculate_accuracy(model_lr)
```

- To calculate the
`accuracy`

`DecisionTreeClassifier()`

```
model_dt = DecisionTreeClassifier()
model_dt.fit(X=features, y=target)
model_dt.score(X=features, y=target)
```

```
0.859877800407332
```

`function()`

**Code Thinking**

- Think of the functions
`result`

- Store that
`object`

to a variable `return`

the`result`

at the end**Indent the body**of the function to the right`def`

ine the`function():`

- Think of what's gonna change when you execute the function with
`different models`

- Locate the
`variable`

that you will change - Turn it into the
`parameter`

of the`function()`

```
model_dt = DecisionTreeClassifier()
model_dt.fit(X=features, y=target)
model_dt.score(X=features, y=target)
```

```
0.859877800407332
```

`result`

you want and put it into a variable```
model_dt = DecisionTreeClassifier()
model_dt.fit(X=features, y=target)
result = model_dt.score(X=features, y=target) #new
```

`return`

to tell the function the object you want in the end```
model_dt = DecisionTreeClassifier()
model_dt.fit(X=features, y=target)
result = model_dt.score(X=features, y=target)
return result #new
```

```
model_dt = DecisionTreeClassifier()
model_dt.fit(X=features, y=target)
result = model_dt.score(X=features, y=target)
return result
```

```
def calculate_accuracy(): #new
model_dt = DecisionTreeClassifier()
model_dt.fit(X=features, y=target)
result = model_dt.score(X=features, y=target)
return result
```

```
def calculate_accuracy(model_dt): #modified
model_dt.fit(X=features, y=target)
result = model_dt.score(X=features, y=target)
return result
```

```
def calculate_accuracy(model): #modified
model.fit(X=features, y=target) #modified
result = model.score(X=features, y=target)
return result
```

```
def calculate_accuracy(model):
"""
This function calculates the accuracy for a given model as a parameter #modified
"""
model.fit(X=features, y=target)
result = model.score(X=features, y=target)
return result
```

```
calculate_accuracy(model_dt)
```

```
0.859877800407332
```

`DecisionTreeClassifier()`

Accuracy```
calculate_accuracy(model_dt)
```

```
0.859877800407332
```

We shall create an empty dictionary that keeps track of every model's score to choose the best one later.

```
dic_accuracy = {}
dic_accuracy['Decision Tree'] = calculate_accuracy(model_dt)
```

`SVC()`

Accuracy```
dic_accuracy['Support Vector Machines'] = calculate_accuracy(model_svc)
dic_accuracy
```

```
{'Decision Tree': 0.859877800407332,
'Support Vector Machines': 0.7837067209775967}
```

`LogisticRegression()`

Accuracy```
dic_accuracy['Logistic Regression'] = calculate_accuracy(model_lr)
dic_accuracy
```

```
{'Decision Tree': 0.859877800407332,
'Support Vector Machines': 0.7837067209775967,
'Logistic Regression': 0.8334012219959267}
```

The Decision Tree is the best model with an score of 85%:

```
sr_accuracy = pd.Series(dic_accuracy).sort_values(ascending=False)
sr_accuracy
```

```
Decision Tree 0.859878
Logistic Regression 0.833401
Support Vector Machines 0.783707
dtype: float64
```

Let's suppose for a moment we are a bank to understand the importance of this chapter. A bank's business is, among other things, to give loans to people who can afford it.

Although the bank may commit mistakes: giving loans to people who cannot afford it or not giving to people who can.

Let's imagine the bank losses of 1M for each 1% of misclassification. As we chose the Decision Tree, the bank lost $15M, as the score suggests. Nevertheless, can we trust the score of 85%?

No, because we are cheating the model's evaluation; we evaluated the models with the same data used for training. In other words, the bank is not interested in evaluating the model of the historical customers; they want to know how good the model is for new customers.

They cannot create new customers. What can they do then?

They separate the data into a train set (70% of customers) used to `.fit()`

the mathematical equation and a test set (30% of customers) to evaluate the mathematical equation.

You can understand the problem better with the following analogy:

Let's **imagine**:

- You have a
`math exam`

on Saturday - Today is Monday
- You want to
**calibrate your level in case you need to study more**for the math exam - How do you calibrate your
`math level`

? - Well, you've got
**100 questions**from past years exams`X`

with 100 solutions`y`

- You may study the 100 questions with 100 solutions
`fit(100questions, 100solutions)`

- Then, you may do a
`mock exam`

with the 100 questions`predict(100questions)`

- And compare
`your_100solutions`

with the`real_100solutions`

- You've got
**90/100 correct answers**`accuracy`

in the mock exam - You think you are
**prepared for the maths exam** - And when you do
**the real exam on Saturday, the mark is 40/100** - Why? How could we have prevented this?
**Solution**: separate the 100 questions into`70 for train`

to study &`30 for test`

for the mock exam.- fit(70questions, 70answers)
- your_30solutions = predict(30questions)
- your_30solutions ?= 30solutions

`train_test_split()`

the Data- The documentation of the function contains a typical example.

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
features, target, test_size=0.30, random_state=42)
```

From all the data:

- 2455 rows
- 8 columns

```
df_internet
```

- 1728 rows (70% of all data) to fit the model
- 7 columns (X: features variables)

```
X_train
```

- 737 rows (30% of all data) to evaluate the model
- 7 columns (X: features variables)

```
X_test
```

- 1728 rows (70% of all data) to fit the model
- 1 columns (y: target variable)

```
y_train
```

```
name
Eileen 0
Lucinda 1
..
Corey 0
Robert 1
Name: internet_usage, Length: 1718, dtype: int64
```

- 737 rows (30% of all data) to evaluate the model
- 1 columns (y: target variable)

```
y_test
```

```
name
Thomas 0
Pedro 1
..
William 1
Charles 1
Name: internet_usage, Length: 737, dtype: int64
```

`fit()`

the model with Train Data```
model_dt.fit(X_train, y_train)
```

```
DecisionTreeClassifier()
```

```
model_dt.score(X_test, y_test)
```

```
0.8046132971506106
```

`DecisionTreeClassifier()`

```
model_dt = DecisionTreeClassifier()
model_dt.fit(X_train, y_train)
model_dt.score(X_test, y_test)
```

```
0.8032564450474898
```

`function()`

**Code Thinking**

- Think of the functions
`result`

- Store that
`object`

to a variable `return`

the`result`

at the end**Indent the body**of the function to the right`def`

ine the`function():`

- Think of what's gonna change when you execute the function with
`different models`

- Locate the
`variable`

that you will change - Turn it into the
`parameter`

of the`function()`

```
def calculate_accuracy_test(model):
model.fit(X_train, y_train)
result = model.score(X_test, y_test)
return result
```

`DecisionTreeClassifier()`

Accuracy```
dic_accuracy_test = {}
dic_accuracy_test['Decision Tree'] = calculate_accuracy_test(model_dt)
dic_accuracy_test
```

```
{'Decision Tree': 0.8032564450474898}
```

`SVC()`

Accuracy```
dic_accuracy_test['Support Vector Machines'] = calculate_accuracy_test(model_svc)
dic_accuracy_test
```

```
{'Decision Tree': 0.8032564450474898,
'Support Vector Machines': 0.7788331071913162}
```

`LogisticRegression()`

Accuracy```
dic_accuracy_test['Logistic Regression'] = calculate_accuracy_test(model_lr)
dic_accuracy_test
```

```
{'Decision Tree': 0.8032564450474898,
'Support Vector Machines': 0.7788331071913162,
'Logistic Regression': 0.8548168249660787}
```

`train_test_split()`

?The picture change quite a lot as the bank is losing 20M due to the model we chose before: the Decision Tree; the score in data that hasn't been seen during training (i.e., new customers) is 80%.

We should have chosen the Logistic Regression because it's the best model (85%) to predict new data and new customers.

In short, we lose 15M if we choose the Logistic Regression, which it's better than the Decision Tree's loss of 20M. Those 5M can make a difference in my life 馃憖

```
sr_accuracy_test = pd.Series(dic_accuracy_test).sort_values(ascending=False)
sr_accuracy_test
```

```
Logistic Regression 0.854817
Decision Tree 0.803256
Support Vector Machines 0.778833
dtype: float64
```

```
df_accuracy = pd.DataFrame({
'Same Data': sr_accuracy,
'Test Data': sr_accuracy_test
})
df_accuracy.style.format('{:.2f}').background_gradient()
```

]]>Look at the following example as an aspiration you can achieve if you fully understand and replicate this whole tutorial with your data.

Let's load a dataset that contains information from countries (rows) considering socio-demographic and economic variables (columns).

```
import plotly.express as px
df_countries = px.data.gapminder()
df_countries
```

Python contains 3 main libraries for Data Visualization:

**Matplotlib**(Mathematical Plotting)**Seaborn**(High-Level based on Matplotlib)**Plotly**(Animated Plots)

I love `plotly`

because the Visualizations are interactive; you may hover the mouse over the points to get information from them:

```
df_countries_2007 = df_countries.query('year == 2007')
px.scatter(data_frame=df_countries_2007, x='gdpPercap', y='lifeExp',
color='continent', hover_name='country', size='pop')
```

You can even animate the plots with a simple parameter. Click on play

PS: The following example is taken from the official plotly library website:

```
px.scatter(df_countries, x="gdpPercap", y="lifeExp", animation_frame="year", animation_group="country",
size="pop", color="continent", hover_name="country",
log_x=True, size_max=55, range_x=[100,100000], range_y=[25,90])
```

In this article, we'll dig into the details of Data Visualization in Python to build up the required knowledge and develop awesome visualizations like the ones we saw before.

Matplotlib is a library used for Data Visualization.

We use the **sublibrary** (module) `pyplot`

from `matplotlib`

library to access the functions.

```
import matplotlib.pyplot as plt
```

Let's make a bar plot:

```
plt.bar(x=['Real Madrid', 'Barcelona', 'Bayern Munich'],
height=[14, 5, 6]);
```

We could have also done a point plot:

```
plt.scatter(x=['Real Madrid', 'Barcelona', 'Bayern Munich'],
y=[14, 5, 6]);
```

But it doesn't make sense with the data we have represented.

Let's create a DataFrame:

```
teams = ['Real Madrid', 'Barcelona', 'Bayern Munich']
uefa_champions = [14, 5, 6]
import pandas as pd
df_champions = pd.DataFrame(data={'Team': teams,
'UEFA Champions': uefa_champions})
df_champions
```

And visualize it using:

```
plt.bar(x=df_champions['Team'],
height=df_champions['UEFA Champions']);
```

```
df_champions.plot.bar(x='Team', y='UEFA Champions');
```

Let's read another dataset: the Football Premier League classification for 2021/2022.

```
df_premier = pd.read_excel(io='../data/premier_league.xlsx')
df_premier
```

We will visualize a point plot, from now own **scatter plot** to check if there is a relationship between the number of goals scored `F`

versus the Points `Pts`

.

```
import seaborn as sns
sns.scatterplot(x='F', y='Pts', data=df_premier);
```

Can we do the same plot with matplotlib `plt`

library?

```
plt.scatter(x='F', y='Pts', data=df_premier);
```

Which are the differences between them?

- The points:
`matplotlib`

points are bigger than`seaborn`

ones - The axis labels:
`matplotlib`

axis labels are non-existent, whereas`seaborn`

places the names of the columns

From which library do the previous functions return the objects?

```
seaborn_plot = sns.scatterplot(x='F', y='Pts', data=df_premier);
```

```
matplotlib_plot = plt.scatter(x='F', y='Pts', data=df_premier);
```

```
type(seaborn_plot)
```

```
matplotlib.axes._subplots.AxesSubplot
```

```
type(matplotlib_plot)
```

```
matplotlib.collections.PathCollection
```

Why does `seaborn`

returns a `matplotlib`

object?

Quoted from the seaborn official website:

Seaborn is a Python data visualization library

based on matplotlib. It provides ahigh-level* interfacefor drawing attractive and informative statistical graphics.

*High-level means the communication between humans and the computer is easier to understand than low-level communication, which goes through 0s and 1s.

Could you place the names of the teams in the points?

```
plt.scatter(x='F', y='Pts', data=df_premier)
for idx, data in df_premier.iterrows():
plt.text(x=data['F'], y=data['Pts'], s=data['Team'])
```

It isn't straightforward.

Is there an easier way?

Yes, you may use an interactive plot with `plotly`

library and display the name of the Team as you hover the mouse on a point.

We use the `express`

module within `plotly`

library to access the functions of the plots:

```
import plotly.express as px
px.scatter(data_frame=df_premier, x='F', y='Pts', hover_name='Team')
```

Let's read another dataset: the sociological data of clients in a restaurant.

```
df_tips = sns.load_dataset(name='tips')
df_tips
```

```
df_tips.sex
```

```
0 Female
1 Male
...
242 Male
243 Female
Name: sex, Length: 244, dtype: category
Categories (2, object): ['Male', 'Female']
```

We need to summarise the data first; we count how many `Female`

and `Male`

people are in the dataset.

```
df_tips.sex.value_counts()
```

```
Male 157
Female 87
Name: sex, dtype: int64
```

```
sr_sex = df_tips.sex.value_counts()
```

Let's place bars equal to the number of people from each gender:

```
px.bar(x=sr_sex.index, y=sr_sex.values)
```

We can also colour the bars based on the category:

```
px.bar(x=sr_sex.index, y=sr_sex.values, color=sr_sex.index)
```

Let's put the same data into a pie plot:

```
px.pie(names=sr_sex.index, values=sr_sex.values, color=sr_sex.index)
```

```
df_tips.total_bill
```

```
0 16.99
1 10.34
...
242 17.82
243 18.78
Name: total_bill, Length: 244, dtype: float64
```

Instead of observing the numbers, we can visualize the distribution of the bills in a **histogram**.

We can observe that most people pay between 10 and 20 dollars. Whereas a few are between 40 and 50.

```
px.histogram(x=df_tips.total_bill)
```

We can also create a **boxplot** where the limits of the boxes indicate the 1st and 3rd quartiles.

The 1st quartile is 13.325, and the 3rd quartile is 24.175. Therefore, 50% of people were billed an amount between these limits.

```
px.box(x=df_tips.total_bill)
```

```
df_tips[['total_bill', 'tip']]
```

We use a scatter plot to see if a relationship exists between two numerical variables.

Do the points go up as you move the eyes from left to right?

As you may observe in the following plot: the higher the amount of the bill, the higher the tip the clients leave for the staff.

```
px.scatter(x='total_bill', y='tip', data_frame=df_tips)
```

Another type of visualization for 2 continuous variables:

```
px.density_contour(x='total_bill', y='tip', data_frame=df_tips)
```

```
df_tips[['day', 'total_bill']]
```

We can summarise the data around how much revenue was generated in each day of the week.

```
df_tips.groupby('day').total_bill.sum()
```

```
day
Thur 1096.33
Fri 325.88
Sat 1778.40
Sun 1627.16
Name: total_bill, dtype: float64
```

```
sr_days = df_tips.groupby('day').total_bill.sum()
```

We can observe that Saturday is the most profitable day as people have spent more money.

```
px.bar(x=sr_days.index, y=sr_days.values)
```

```
px.bar(x=sr_days.index, y=sr_days.values, color=sr_days.index)
```

```
df_tips[['day', 'size']]
```

Which combination of day-size is the most frequent table you can observe in the restaurant?

The following plot shows that Saturdays with 2 people at the table is the most common phenomenon at the restaurant.

They could create an advertisement that targets couples to have dinner on Saturdays and make more money.

```
px.density_heatmap(x='day', y='size', data_frame=df_tips)
```

The following examples are taken directly from plotly.

```
df_gapminder = px.data.gapminder()
px.scatter_geo(df_gapminder, locations="iso_alpha", color="continent", #!
hover_name="country", size="pop",
animation_frame="year",
projection="natural earth")
```

```
import plotly.express as px
df = px.data.election()
geojson = px.data.election_geojson()
fig = px.choropleth_mapbox(df, geojson=geojson, color="Bergeron",
locations="district", featureidkey="properties.district",
center={"lat": 45.5517, "lon": -73.7073},
mapbox_style="carto-positron", zoom=9)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
```

```
import plotly.express as px
df = px.data.election()
geojson = px.data.election_geojson()
fig = px.choropleth_mapbox(df, geojson=geojson, color="winner",
locations="district", featureidkey="properties.district",
center={"lat": 45.5517, "lon": -73.7073},
mapbox_style="carto-positron", zoom=9)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
```

Machine Learning is all about calculating the best numbers of a mathematical equation by minimising the distance between real data and predictions.

The form of a Linear Regression mathematical equation is as follows:

$$ y = (a) + (b) \cdot x $$

As we see in the following plot, **not any mathematical equation is valid**; the red line doesn't fit the real data (blue points), whereas the green one is the best.

How do we understand the development of Machine Learning models in Python **to predict what may happen in the future**?

This tutorial covers the topics described below using **USA Car Crashes data** to predict the accidents based on alcohol.

- Step-by-step procedure to compute a Linear Regression:
`.fit()`

the numbers of the mathematical equation`.predict()`

the future with the mathematical equation`.score()`

how good is the mathematical equation

- How to
**visualise**the Linear Regression model? - How to
**evaluate**Regression models step by step?- Residuals Sum of Squares
- Total Sum of Squares
- R Squared Ratio $R^2$

- How to
**interpret**the coefficients of the Linear Regression? - Compare the Linear Regression to other Machine Learning models such as:
- Random Forest
- Support Vector Machines

- Why
**we don't need to know the maths**behind every model to apply Machine Learning in Python?

- This dataset contains
**statistics about Car Accidents**(columns) - In each one of
**USA States**(rows)

Visit this website if you want to know the measures of the columns.

```
import seaborn as sns #!
df_crashes = sns.load_dataset(name='car_crashes', index_col='abbrev')[['alcohol', 'total']]
df_crashes.rename({'total': 'accidents'}, axis=1, inplace=True)
df_crashes
```

- As always, we need to use a function

Where is the function?

- It should be in a library

Which is the Python library for Machine Learning?

- Sci-Kit Learn, see website

How can we access the function to compute a Linear Regression model?

- We need to import the
`LinearRegression`

class within`linear_model`

module:

```
from sklearn.linear_model import LinearRegression
```

- Now, we create an instance
`model_lr`

of the class`LinearRegression`

:

```
model_lr = LinearRegression()
```

Which function applies the Linear Regression **algorithm** in which the **Residual Sum of Squares is minimised**?

```
model_lr.fit()
```

```
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [186], in <cell line: 1>()
----> 1 model_lr.fit()
TypeError: fit() missing 2 required positional arguments: 'X' and 'y'
```

Why is it asking for two parameters: `y`

and `X`

?

The algorithm must distinguish between the variable we want to predict (`y`

), and the variables used to explain (`X`

) the prediction.

`y`

: target ~ independent ~ label ~ class variable`X`

: features ~ dependent ~ explanatory variables

```
target = df_crashes['accidents']
features = df_crashes[['alcohol']]
```

```
model_lr.fit(X=features, y=target)
```

```
LinearRegression()
```

Take the historical data:

```
features
```

To calculate predictions through the Model's Mathematical Equation:

```
model_lr.predict(X=features)
```

```
array([17.32111171, 15.05486718, 16.44306899, 17.69509287, 12.68699734,
13.59756016, 13.76016066, 15.73575679, 9.0955587 , 16.40851638,
13.78455074, 20.44100889, 14.87600663, 14.70324359, 14.40446516,
13.8353634 , 14.54064309, 15.86177218, 19.6076813 , 15.06502971,
13.98780137, 11.69106925, 13.88211104, 11.5162737 , 16.94713055,
16.98371566, 24.99585551, 16.45729653, 15.41868581, 12.93089809,
12.23171592, 15.95526747, 13.10772614, 16.44306899, 26.26007443,
15.60161138, 17.58737003, 12.62195713, 17.32517672, 14.43088774,
25.77430543, 18.86988151, 17.3515993 , 20.84141263, 9.53254755,
14.15040187, 12.82724027, 12.96748321, 19.40239816, 15.11380986,
17.17477126])
```

Can you see the difference between reality and prediction?

- Model predictions aren't perfect; they don't predict the real data exactly. Nevertheless, they make a fair approximation allowing decision-makers to understand the future better.

```
df_crashes['pred_lr'] = model_lr.predict(X=features)
df_crashes
```

The orange dots reference the predictions lined up in a line because the Linear Regression model calculates the best coefficients (numbers) for a line's mathematical equation based on historical data.

```
import matplotlib.pyplot as plt
```

```
sns.scatterplot(x='alcohol', y='accidents', data=df_crashes)
sns.scatterplot(x='alcohol', y='pred_lr', data=df_crashes);
```

We have orange dots for the alcohol represented in our `DataFrame`

. Were we to make estimations about all possible alcohol numbers, we'd get a **sequence of consecutive points**, which represented a line. Let's draw it with `.lineplot()`

function:

```
sns.scatterplot(x='alcohol', y='accidents', data=df_crashes)
sns.scatterplot(x='alcohol', y='pred_lr', data=df_crashes);
sns.lineplot(x='alcohol', y='pred_lr', data=df_crashes, color='orange');
```

To measure the quality of the model, we use the `.score()`

function to correctly calculate the difference between the model's predictions and reality.

```
model_lr.score(X=features, y=target)
```

```
0.7269492966665405
```

The step-by-step procedure of the previous calculation starts with the difference between reality and predictions:

```
df_crashes['accidents'] - df_crashes['pred_lr']
```

```
abbrev
AL 1.478888
AK 3.045133
...
WI -1.313810
WY 0.225229
Length: 51, dtype: float64
```

This difference is usually called **residuals**:

```
df_crashes['residuals'] = df_crashes['accidents'] - df_crashes['pred_lr']
df_crashes
```

We cannot use all the residuals to tell how good our model is. Therefore, we need to add them up:

```
df_crashes.residuals.sum()
```

```
1.4033219031261979e-13
```

Let's round to two decimal points to suppress the scientific notation:

```
df_crashes.residuals.sum().round(2)
```

```
0.0
```

But we get ZERO. Why?

The residuals contain positive and negative numbers; some points are above the line, and others are below the line.

To turn negative values into positive values, we square the residuals:

```
df_crashes['residuals^2'] = df_crashes.residuals**2
df_crashes
```

And finally, add the residuals up to calculate the **Residual Sum of Squares (RSS)**:

```
df_crashes['residuals^2'].sum()
```

```
231.96888653310063
```

```
RSS = df_crashes['residuals^2'].sum()
```

$$ RSS = \sum(y_i - \hat{y})^2 $$

where

- y_i is the real number of accidents
- $\hat y$ is the predicted number of accidents
- RSS: Residual Sum of Squares

The model was made to predict the number of accidents.

We should ask: how good are the variation of the model's predictions compared to the variation of the real data (real number of accidents)?

We have already calculated the variation of the model's prediction. Now we calculate the variation of the real data by comparing each accident value to the average:

```
df_crashes.accidents
```

```
abbrev
AL 18.8
AK 18.1
...
WI 13.8
WY 17.4
Name: accidents, Length: 51, dtype: float64
```

```
df_crashes.accidents.mean()
```

```
15.79019607843137
```

$$ y_i - \bar y $$

Where x is the number of accidents

```
df_crashes.accidents - df_crashes.accidents.mean()
```

```
abbrev
AL 3.009804
AK 2.309804
...
WI -1.990196
WY 1.609804
Name: accidents, Length: 51, dtype: float64
```

```
df_crashes['real_residuals'] = df_crashes.accidents - df_crashes.accidents.mean()
df_crashes
```

We square the residuals due for the same reason as before (convert negative values into positive ones):

```
df_crashes['real_residuals^2'] = df_crashes.real_residuals**2
```

$$ TTS = \sum(y_i - \bar y)^2 $$

where

- y_i is the number of accidents
- $\bar y$ is the average number of accidents
- TTS: Total Sum of Squares

And we add up the values to get the **Total Sum of Squares (TSS)**:

```
df_crashes['real_residuals^2'].sum()
```

```
849.5450980392156
```

```
TSS = df_crashes['real_residuals^2'].sum()
```

The ratio between RSS and TSS represents how much our model fails concerning the variation of the real data.

```
RSS/TSS
```

```
0.2730507033334595
```

0.27 is the badness of the model as **RSS** represents the **residuals** (errors) of the model.

To calculate the **goodness** of the model, we need to subtract the ratio RSS/TSS to 1:

$$ R^2 = 1 - \frac{RSS}{TSS} = 1 - \frac{\sum(y_i - \hat{y})^2}{\sum(y_i - \bar y)^2} $$

```
1 - RSS/TSS
```

```
0.7269492966665405
```

The model can explain 72.69% of the total number of accidents variability.

The following image describes how we calculate the goodness of the model.

How do we get the numbers of the mathematical equation of the Linear Regression?

- We need to look inside the object
`model_lr`

and show the attributes with`.__dict__`

(the numbers were computed with the`.fit()`

function):

```
model_lr.__dict__
```

```
{'fit_intercept': True,
'normalize': 'deprecated',
'copy_X': True,
'n_jobs': None,
'positive': False,
'feature_names_in_': array(['alcohol'], dtype=object),
'n_features_in_': 1,
'coef_': array([2.0325063]),
'_residues': 231.9688865331006,
'rank_': 1,
'singular_': array([12.22681605]),
'intercept_': 5.857776154826299}
```

`intercept_`

is the (a) number of the mathematical equation`coef_`

is the (b) number of the mathematical equation

$$ accidents = (a) + (b) \cdot alcohol \ accidents = (intercept_) + (coef_) \cdot alcohol \ accidents = (5.857) + (2.032) \cdot alcohol $$

For every unit of alcohol increased, the number of accidents will increase by 2.032 units.

```
import pandas as pd
df_to_pred = pd.DataFrame({'alcohol': [1,2,3,4,5]})
df_to_pred['pred_lr'] = 5.857 + 2.032 * df_to_pred.alcohol
df_to_pred['diff'] = df_to_pred.pred_lr.diff()
df_to_pred
```

Could we make a better model that improves the current Linear Regression Score?

```
model_lr.score(X=features, y=target)
```

```
0.7269492966665405
```

- Let's try a Random Forest and a Support Vector Machines.

Do we need to know the maths behind these models to implement them in Python?

No. As we explain in this tutorial, all you need to do is:

`fit()`

`.predict()`

`.score()`

- Repeat

`RandomForestRegressor()`

in Python```
from sklearn.ensemble import RandomForestRegressor
model_rf = RandomForestRegressor()
model_rf.fit(X=features, y=target)
```

```
RandomForestRegressor()
```

```
model_rf.predict(X=features)
```

```
array([18.644 , 16.831 , 17.54634286, 21.512 , 12.182 ,
13.15 , 12.391 , 17.439 , 7.775 , 17.74664286,
14.407 , 18.365 , 15.101 , 14.132 , 13.553 ,
15.097 , 15.949 , 19.857 , 21.114 , 15.53 ,
13.241 , 8.98 , 14.363 , 9.54 , 17.208 ,
16.593 , 22.087 , 16.24144286, 14.478 , 11.51 ,
11.59 , 18.537 , 11.77 , 17.54634286, 23.487 ,
14.907 , 20.462 , 12.59 , 18.38 , 12.449 ,
23.487 , 20.311 , 19.004 , 19.22 , 9.719 ,
13.476 , 12.333 , 11.08 , 22.368 , 14.67 ,
17.966 ])
```

```
df_crashes['pred_rf'] = model_rf.predict(X=features)
```

```
model_rf.score(X=features, y=target)
```

```
0.9549469198566546
```

Let's create a dictionary that stores the Score of each model:

```
dic_scores = {}
dic_scores['lr'] = model_lr.score(X=features, y=target)
dic_scores['rf'] = model_rf.score(X=features, y=target)
```

`SVR()`

in Python```
from sklearn.svm import SVR
model_sv = SVR()
model_sv.fit(X=features, y=target)
```

```
SVR()
```

```
model_sv.predict(X=features)
```

```
array([18.29570777, 15.18462721, 17.2224187 , 18.6633175 , 12.12434781,
13.10691581, 13.31612684, 16.21131216, 12.66062465, 17.17537208,
13.34820949, 19.38920329, 14.91415215, 14.65467023, 14.2131504 ,
13.41560202, 14.41299448, 16.39752499, 19.4896662 , 15.20002787,
13.62200798, 11.5390483 , 13.47824339, 11.49818909, 17.87053595,
17.9144274 , 19.60736085, 17.24170425, 15.73585463, 12.35136579,
11.784815 , 16.53431108, 12.53373232, 17.2224187 , 19.4773929 ,
16.01115736, 18.56379706, 12.06891287, 18.30002795, 14.25171609,
19.59597679, 19.37950461, 18.32794218, 19.29994413, 12.26345665,
13.84847453, 12.25128025, 12.38791686, 19.48212198, 15.27397732,
18.1357253 ])
```

```
df_crashes['pred_sv'] = model_sv.predict(X=features)
```

```
model_sv.score(X=features, y=target)
```

```
0.7083438012012769
```

```
dic_scores['sv'] = model_sv.score(X=features, y=target)
```

The best model is the Random Forest with a Score of 0.95:

```
pd.Series(dic_scores).sort_values(ascending=False)
```

```
rf 0.954947
lr 0.726949
sv 0.708344
dtype: float64
```

Let's put the following data:

```
df_crashes[['accidents', 'pred_lr', 'pred_rf', 'pred_sv']]
```

Into a plot:

```
sns.scatterplot(x='alcohol', y='accidents', data=df_crashes, label='Real Data')
sns.scatterplot(x='alcohol', y='pred_lr', data=df_crashes, label='Linear Regression')
sns.lineplot(x='alcohol', y='pred_lr', data=df_crashes, color='orange')
sns.scatterplot(x='alcohol', y='pred_rf', data=df_crashes, label='Random Forest')
sns.scatterplot(x='alcohol', y='pred_sv', data=df_crashes, label='Support Vector Machines');
```

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Ask him any doubt on **Twitter** or **LinkedIn**

Look at the following example as an aspiration you can achieve if you fully understand and replicate this whole tutorial with your data.

Let's load a dataset that contains information from transactions in tables (rows) at a restaurant considering socio-demographic and economic variables (columns).

```
import seaborn as sns
df_tips = sns.load_dataset('tips')
df_tips
```

Grouping data to summarise the information helps you identify conclusions. For example, the summary below shows that **Dinners on Sundays** come to the best customers because they:

- Spend more on average (\$21.41)
- Give more tips on average (\$3.25)
- Come more people at the same table on average (\$2.84)

```
df_tips.groupby(by=['day', 'time'])\
.mean()\
.fillna(0)\
.style.format('{:.2f}').background_gradient(axis=0)
```

```
df_tips.groupby(by=['day', 'time'])\
.mean()\
.fillna(0)\
.style.format('{:.2f}').bar(axis=0, width=50, align='zero')
```

Let's dig into the details of the `.groupby()`

function from the basics in the following sections.

We use the `.groupby()`

function to generate an object that contains as many `DataFrames`

as categories are in the column.

```
df_tips.groupby('sex')
```

As we have two groups in sex (Female and Male), the length of the `DataFrameGroupBy`

object returned by the `groupby()`

function is 2:

```
len(df_tips.groupby('sex'))
```

How can we work with the object `DataFrameGroupBy`

?

We use the `.mean()`

function to get the average of the numerical columns for the two groups:

```
df_tips.groupby('sex').mean()
```

A pretty and simple syntax to summarise the information, right?

- But what's going on inside the
`DataFrameGroupBy`

object?

```
df_tips.groupby('sex')
```

```
df_grouped = df_tips.groupby('sex')
```

The `DataFrameGroupBy`

object contains 2 `DataFrames`

. To see one of them `DataFrame`

you need to use the function `.get_group()`

and pass the group whose `DataFrame`

you'd like to return:

```
df_grouped.get_group('Male')
```

```
df_grouped.get_group('Female')
```

As the `DataFrameGroupBy`

distinguish the categories, at the moment we apply an aggregation function (click here to see a list of them), we will get the mathematical operations for those groups separately:

```
df_grouped.mean()
```

We could apply the function to each `DataFrame`

separately. Although *it is not the point of the .groupby() function*.

```
df_grouped.get_group('Male').mean(numeric_only=True)
```

```
df_grouped.get_group('Female').mean(numeric_only=True)
```

To get the results for just 1 column of interest, we access the column:

```
df_grouped.total_bill
```

And use the aggregation function we wish, `.sum()`

in this case:

```
df_grouped.total_bill.sum()
```

We get the result for just 1 column (total_bill) because the `DataFrames`

generated at `.groupby()`

are accessed as if they were simple `DataFrames`

:

```
df_grouped.get_group('Female')
```

```
df_grouped.get_group('Female').total_bill
```

```
df_grouped.get_group('Female').total_bill.sum()
```

```
df_grouped.get_group('Male').total_bill.sum()
```

```
df_grouped.total_bill.sum()
```

So far, we have summarised the data based on the categories of just one column. But, what if we'd like to summarise the data **based on the combinations** of the categories within different categorical columns?

```
df_tips.groupby(by=['day', 'smoker']).sum()
```

We could have also used another function `.pivot_table()`

to get the same numbers:

```
df_tips.pivot_table(index='day', columns='smoker', aggfunc='sum')
```

Which one is best?

- I leave it up to your choice; I'd prefer to use the
`.pivot_table()`

because the syntax makes it more accessible.

The thing doesn't stop here; we can even compute different aggregation functions at the same time:

```
df_tips.groupby(by=['day', 'smoker'])\
.total_bill\
.agg(func=['sum', 'mean'])
```

```
df_tips.pivot_table(index='day', columns='smoker',
values='total_bill', aggfunc=['sum', 'mean'])
```

```
dfres = df_tips.pivot_table(index='day', columns='smoker',
values='total_bill', aggfunc=['sum', 'mean'])
```

You could even style the output `DataFrame`

:

```
dfres.style.background_gradient()
```

For me, it's nicer than styling the `.groupby()`

returned DataFrame.

As we say in Spain:

Pa' gustos los colores!

```
df_tips.groupby(by=['day', 'smoker']).total_bill.agg(func=['sum', 'mean'])
```

```
dfres = df_tips.groupby(by=['day', 'smoker']).total_bill.agg(func=['sum', 'mean'])
```

```
dfres.style.background_gradient()
```

We can compute more than one mathematical operation:

```
df_tips.pivot_table(index='sex', columns='time',
aggfunc=['sum', 'mean'], values='total_bill')
```

And use more than one column in each of the parameters:

```
df_tips.pivot_table(index='sex', columns='time',
aggfunc=['sum', 'mean'], values=['total_bill', 'tip'])
```

```
df_tips.pivot_table(index=['day', 'smoker'], columns='time',
aggfunc=['sum', 'mean'], values=['total_bill', 'tip'])
```

```
df_tips.pivot_table(index=['day', 'smoker'], columns=['time', 'sex'],
aggfunc=['sum', 'mean'], values=['total_bill', 'tip'])
```

`.size()`

Function`.groupby()`

The `.size()`

is a function used to count the number of rows (observations) in each of the `DataFrames`

generated by `.groupby()`

.

```
df_grouped.size()
```

```
df_tips.groupby(by=['sex', 'time']).size()
```

`.pivot_table()`

We can use `.pivot_table()`

to represent the data clearer:

```
df_tips.pivot_table(index='sex', columns='time', aggfunc='size')
```

```
df_tips.pivot_table(index='smoker', columns=['day', 'sex'],aggfunc='size')
```

```
dfres = df_tips.pivot_table(index='smoker', columns=['day', 'sex'], aggfunc='size')
```

```
dfres.style.background_gradient()
```

```
df_tips.pivot_table(index=['day', 'time'], columns=['smoker', 'sex'], aggfunc='size')
```

```
dfres = df_tips.pivot_table(index=['day', 'time'], columns=['smoker', 'sex'], aggfunc='size')
```

```
dfres.style.background_gradient()
```

We can even choose the way we'd like to gradient colour the cells:

`axis=1`

: the upper value between the columns of the same row`axis=2`

: the upper value between the rows of the same column

```
dfres.style.background_gradient(axis=1)
```

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Ask him any doubt on **Twitter** or **LinkedIn**

The following image is pretty self-explanatory to understand how APIs work:

- The API is the waiter who
- Takes the request from the clients
- And take them to the kitchen
- To later serve the "cooked" response back to the clients

The URL is an address we use to locate files on the Internet:

- Documents: pdf, ppt, docx,...
- Multimedia: mp4, mp3, mov, png, jpeg,...
- Data Files: csv, json, db,...

Check out the following gif where we inspect the resources we download when locating https://economist.com.

URL - Watch Video

An Application Program Interface (API) is a communications tool between the client and the server to carry out information through an URL.

The API defines the rules by which the URL will work. Like Python, the API contains:

- Functions
- Parameters
- Accepted Values

The only extra knowledge we need to consider is the use of **tokens**.

A token is a code you use in the request to validate your identity, as most platforms charge money to use their API.

```
token = 'PASTE_YOUR_TOKEN_HERE'
```

In the website documentation.

```
'https://www.alphavantage.co/query?function=TIME_SERIES_INTRADAY&symbol=IBM&interval=5min&apikey=demo'
```

Every time you make a **call to an API** requesting some information, you later receive a **response**.

Check this JSON, a type of file that stores structured data returned by the API.

If you want to know more about the JSON file, see article.

- Base API:
`https://www.alphavantage.co/query?`

- Parameters:
`function=TIME_SERIES_INTRADAY`

`symbol=IBM`

`interval=5min`

`apikey=demo`

```
import requests
api_call = 'https://www.alphavantage.co/query?function=TIME_SERIES_INTRADAY&symbol=IBM&interval=5min&apikey=demo'
requests.get(url=api_call)
```

```
>>> <Response [200]>
```

```
res = requests.get(url=api_call)
```

The function returns an object containing all the information related to the **API request and response**.

```
res.apparent_encoding
```

```
>>> 'ascii'
```

```
res.headers
```

```
>>> {'Date': 'Mon, 18 Jul 2022 18:01:19 GMT', 'Content-Type': 'application/json', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Cookie', 'X-Frame-Options': 'SAMEORIGIN', 'Allow': 'GET, HEAD, OPTIONS', 'Via': '1.1 vegur', 'CF-Cache-Status': 'DYNAMIC', 'Expect-CT': 'max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"', 'Server': 'cloudflare', 'CF-RAY': '72cd1f3959323851-MAD', 'Content-Encoding': 'gzip'}
```

```
res.history
```

```
>>> []
```

To place the response object into a Python interpretable object, we need to use the function `.json()`

to get a dictionary with the data.

```
res.json()
```

```
>>> {'Meta Data': {'1. Information': 'Intraday (5min) open, high, low, close prices and volume',
'2. Symbol': 'IBM',
'3. Last Refreshed': '2022-06-29 19:25:00',
'4. Interval': '5min',
'5. Output Size': 'Compact',
'6. Time Zone': 'US/Eastern'},
'Time Series (5min)': {'2022-06-29 19:25:00': {'1. open': '140.7100',
'2. high': '140.7100',
'3. low': '140.7100',
'4. close': '140.7100',
'5. volume': '531'},
...
'2022-06-28 17:25:00': {'1. open': '142.1500',
'2. high': '142.1500',
'3. low': '142.1500',
'4. close': '142.1500',
'5. volume': '100'}}}
```

```
data = res.json()
```

The data in the dictionary represents the symbol **IBM** in intervals of **5min** for the **TIME_SERIES_INTRADAY**.

Check the dictionary above to confirm.

```
res.request.path_url
```

```
>>> '/query?function=TIME_SERIES_INTRADAY&symbol=IBM&interval=5min&apikey=demo'
```

We need to change the value of the parameter `symbol`

within the URL we use to call the API:

```
stock = 'AAPL'
api_call = f'https://www.alphavantage.co/query?function=TIME_SERIES_INTRADAY&symbol={stock}&interval=5min&apikey=demo'
res = requests.get(url=api_call)
res.json()
```

```
>>> {'Information': 'The **demo** API key is for demo purposes only. Please claim your free API key at (https://www.alphavantage.co/support/#api-key) to explore our full API offerings. It takes fewer than 20 seconds.'}
```

The API returns a JSON which implicitly says we previously used a ***demo** API key* to retrieve data from the symbol IBM. Nevertheless, using the same demo API key to retrieve the AAPL stock data is impossible.

We should include our token in the API call:

```
token
```

```
>>> 'YOUR_PASTED_TOKEN_ABOVE'
```

```
api_call = f'https://www.alphavantage.co/query?function=TIME_SERIES_INTRADAY&symbol={stock}&interval=5min&apikey={token}'
res = requests.get(url=api_call)
data = res.json()
data
```

```
>>> {'Meta Data': {'1. Information': 'Intraday (5min) open, high, low, close prices and volume',
'2. Symbol': 'AAPL',
'3. Last Refreshed': '2022-07-15 20:00:00',
'4. Interval': '5min',
'5. Output Size': 'Compact',
'6. Time Zone': 'US/Eastern'},
'Time Series (5min)': {'2022-06-29 19:25:00': {'1. open': '140.7100',
'2. high': '140.7100',
'3. low': '140.7100',
'4. close': '140.7100',
'5. volume': '531'},
...
'2022-06-28 17:25:00': {'1. open': '142.1500',
'2. high': '142.1500',
'3. low': '142.1500',
'4. close': '142.1500',
'5. volume': '100'}}}
```

`data`

? Why?`data`

contains a dictionary, which it's a very simple Python object.

```
data.sum()
```

```
>>>
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Input In [46], in <cell line: 1>()
----> 1 data.sum()
AttributeError: 'dict' object has no attribute 'sum'
```

We need to create a `DataFrame`

out of this dictionary to have a powerful object we could use to apply many functions.

```
import dataframe_image as dfi
```

```
import pandas as pd
pd.DataFrame(data=data)
```

We'd like to have the open, high, close,... variables as the columns. Not `Meta Data`

and `Time Series (5min)`

. Why is this happening?

`Meta Data`

and`Time Series (5min)`

are the`keys`

of the dictionary`data`

.- The value of the key
`Time Series (5min)`

key is the information we want in the DataFrame.

```
data['Time Series (5min)']
```

```
>>> {'2022-07-15 20:00:00': {'1. open': '150.0300',
'2. high': '150.0700',
'3. low': '150.0300',
'4. close': '150.0300',
'5. volume': '4752'},
...
'2022-06-28 17:25:00': {'1. open': '142.1500',
'2. high': '142.1500',
'3. low': '142.1500',
'4. close': '142.1500',
'5. volume': '100'}
```

```
pd.DataFrame(data['Time Series (5min)'])
```

```
df_apple = pd.DataFrame(data['Time Series (5min)'])
```

The `DataFrame`

is not represented as we'd like because the Dates are in the columns and the variables are in the index. So which function can we use to transpose the `DataFrame`

?

```
df_apple.transpose()
```

```
df_apple = df_apple.transpose()
```

Let's get the average value from the close price:

```
df_apple['4. close']
```

```
>>> 2022-07-15 20:00:00 150.0300
2022-07-15 19:55:00 150.0700
...
2022-07-15 11:45:00 149.1500
2022-07-15 11:40:00 149.1100
Name: 4. close, Length: 100, dtype: object
```

```
df_apple['4. close'].mean()
```

```
>>>
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File ~/miniforge3/lib/python3.9/site-packages/pandas/core/nanops.py:1622, in _ensure_numeric(x)
1621 try:
-> 1622 x = float(x)
1623 except (TypeError, ValueError):
1624 # e.g. "1+1j" or "foo"
ValueError: could not convert string to float: '150.0300150.0700150.0400150.0100150.0300150.0500149.9900149.9900149.9800149.9900150.0000149.9900150.0000149.9900150.0000149.9800150.0000150.0100150.0500150.0100150.0100150.0000150.0200150.0100150.0100150.0098150.0100150.0000150.0200150.0000150.0007150.0100150.0100150.0200150.0325150.0200150.0300150.0200150.0000150.0300150.0001150.0000150.0000150.0100150.0560150.0500150.0900150.1700149.8900149.4410149.5300149.2700149.2160149.2094149.2000149.3450149.3778149.5450149.3600149.3500149.4700149.5400149.3993149.2150149.3015149.4100149.2916149.2650149.1200149.0400148.9800149.1350148.8800149.1850149.3924149.4600149.3496149.3250149.0874149.0600149.0000149.0101148.9350148.9100148.8620149.0050148.8100148.6340148.5500148.7600148.6950148.6800148.5488148.3500148.7351148.7910148.9305149.2000149.1500149.1100'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
File ~/miniforge3/lib/python3.9/site-packages/pandas/core/nanops.py:1626, in _ensure_numeric(x)
1625 try:
-> 1626 x = complex(x)
1627 except ValueError as err:
1628 # e.g. "foo"
ValueError: complex() arg is a malformed string
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
Input In [38], in <cell line: 1>()
----> 1 df_apple['4. close'].mean()
File ~/miniforge3/lib/python3.9/site-packages/pandas/core/generic.py:11117, in NDFrame._add_numeric_operations.<locals>.mean(self, axis, skipna, level, numeric_only, **kwargs)
11099 @doc(
11100 _num_doc,
11101 desc="Return the mean of the values over the requested axis.",
(...)
11115 **kwargs,
11116 ):
> 11117 return NDFrame.mean(self, axis, skipna, level, numeric_only, **kwargs)
File ~/miniforge3/lib/python3.9/site-packages/pandas/core/generic.py:10687, in NDFrame.mean(self, axis, skipna, level, numeric_only, **kwargs)
10679 def mean(
10680 self,
10681 axis: Axis | None | lib.NoDefault = lib.no_default,
(...)
10685 **kwargs,
10686 ) -> Series | float:
> 10687 return self._stat_function(
10688 "mean", nanops.nanmean, axis, skipna, level, numeric_only, **kwargs
10689 )
File ~/miniforge3/lib/python3.9/site-packages/pandas/core/generic.py:10639, in NDFrame._stat_function(self, name, func, axis, skipna, level, numeric_only, **kwargs)
10629 warnings.warn(
10630 "Using the level keyword in DataFrame and Series aggregations is "
10631 "deprecated and will be removed in a future version. Use groupby "
(...)
10634 stacklevel=find_stack_level(),
10635 )
10636 return self._agg_by_level(
10637 name, axis=axis, level=level, skipna=skipna, numeric_only=numeric_only
10638 )
> 10639 return self._reduce(
10640 func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
10641 )
File ~/miniforge3/lib/python3.9/site-packages/pandas/core/series.py:4471, in Series._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
4467 raise NotImplementedError(
4468 f"Series.{name} does not implement {kwd_name}."
4469 )
4470 with np.errstate(all="ignore"):
-> 4471 return op(delegate, skipna=skipna, **kwds)
File ~/miniforge3/lib/python3.9/site-packages/pandas/core/nanops.py:93, in disallow.__call__.<locals>._f(*args, **kwargs)
91 try:
92 with np.errstate(invalid="ignore"):
---> 93 return f(*args, **kwargs)
94 except ValueError as e:
95 # we want to transform an object array
96 # ValueError message to the more typical TypeError
97 # e.g. this is normally a disallowed function on
98 # object arrays that contain strings
99 if is_object_dtype(args[0]):
File ~/miniforge3/lib/python3.9/site-packages/pandas/core/nanops.py:155, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds)
153 result = alt(values, axis=axis, skipna=skipna, **kwds)
154 else:
--> 155 result = alt(values, axis=axis, skipna=skipna, **kwds)
157 return result
File ~/miniforge3/lib/python3.9/site-packages/pandas/core/nanops.py:410, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
407 if datetimelike and mask is None:
408 mask = isna(values)
--> 410 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
412 if datetimelike:
413 result = _wrap_results(result, orig_values.dtype, fill_value=iNaT)
File ~/miniforge3/lib/python3.9/site-packages/pandas/core/nanops.py:698, in nanmean(values, axis, skipna, mask)
695 dtype_count = dtype
697 count = _get_counts(values.shape, mask, axis, dtype=dtype_count)
--> 698 the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum))
700 if axis is not None and getattr(the_sum, "ndim", False):
701 count = cast(np.ndarray, count)
File ~/miniforge3/lib/python3.9/site-packages/pandas/core/nanops.py:1629, in _ensure_numeric(x)
1626 x = complex(x)
1627 except ValueError as err:
1628 # e.g. "foo"
-> 1629 raise TypeError(f"Could not convert {x} to numeric") from err
1630 return x
TypeError: Could not convert 150.0300150.0700150.0400150.0100150.0300150.0500149.9900149.9900149.9800149.9900150.0000149.9900150.0000149.9900150.0000149.9800150.0000150.0100150.0500150.0100150.0100150.0000150.0200150.0100150.0100150.0098150.0100150.0000150.0200150.0000150.0007150.0100150.0100150.0200150.0325150.0200150.0300150.0200150.0000150.0300150.0001150.0000150.0000150.0100150.0560150.0500150.0900150.1700149.8900149.4410149.5300149.2700149.2160149.2094149.2000149.3450149.3778149.5450149.3600149.3500149.4700149.5400149.3993149.2150149.3015149.4100149.2916149.2650149.1200149.0400148.9800149.1350148.8800149.1850149.3924149.4600149.3496149.3250149.0874149.0600149.0000149.0101148.9350148.9100148.8620149.0050148.8100148.6340148.5500148.7600148.6950148.6800148.5488148.3500148.7351148.7910148.9305149.2000149.1500149.1100 to numeric
```

Why are we getting this ugly error?

- The values of the
`Series`

aren't numerical objects.

```
df_apple.dtypes
```

```
>>> 1. open object
2. high object
3. low object
4. close object
5. volume object
dtype: object
```

Can you change the type of the values into numerical objects?

```
df_apple = df_apple.apply(pd.to_numeric)
```

Now that we have the `Series`

values as numerical objects:

```
df_apple.dtypes
```

```
>>> 1. open float64
2. high float64
3. low float64
4. close float64
5. volume int64
dtype: object
```

We should be able to get the average close price:

```
df_apple['4. close'].mean()
```

```
>>> 149.551566
```

What else could we do?

```
df_apple.hist();
```

```
df_apple.hist(layout=(2,3), figsize=(15,8));
```

```
token = 'PASTE_YOUR_TOKEN_HERE'
stock = 'AAPL'
api_call = f'https://www.alphavantage.co/query?function=TIME_SERIES_INTRADAY&symbol={stock}&interval=5min&apikey={token}'
res = requests.get(url=api_call)
data = res.json()
df_apple = pd.DataFrame(data=data['Time Series (5min)'])
df_apple = df_apple.transpose()
df_apple = df_apple.apply(pd.to_numeric)
df_apple.hist(layout=(2,3), figsize=(15,8));
```

```
size='full'
info_type = 'TIME_SERIES_DAILY'
api_call = f'https://www.alphavantage.co/query?function={info_type}&symbol={stock}&outputsize={size}&apikey={token}'
res = requests.get(url=api_call)
data = res.json()
df_apple_daily = pd.DataFrame(data['Time Series (Daily)'])
df_apple_daily = df_apple_daily.transpose()
df_apple_daily = df_apple_daily.apply(pd.to_numeric)
df_apple_daily.index = pd.to_datetime(df_apple_daily.index)
df_apple_daily.plot.line(layout=(2,3), figsize=(15,8), subplots=True);
```

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Programming is all about working with data.

We can work with many types of data structures. Nevertheless, the pandas DataFarme is the most useful because it contains functions that automate a lot of work by writing a simple line of code.

This tutorial will teach you how to work with the `pandas.DataFrame`

object.

Before, we will demonstrate why working with simple Arrays (what most people do) makes your life more difficult than it should be.

An array is any object that can store **more than one object**. For example, the `list`

:

```
[100, 134, 87, 99]
```

Let's say we are talking about the revenue our e-commerce has had over the last 4 months:

```
list_revenue = [100, 134, 87, 99]
```

We want to calculate the total revenue (i.e., we sum up the objects within the list):

```
list_revenue.sum()
```

```
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Input In [3], in <cell line: 1>()
----> 1 list_revenue.sum()
AttributeError: 'list' object has no attribute 'sum'
```

The list is a *poor* object which doesn't contain powerful functions.

What can we do then?

We convert the list to a powerful object such as the `Series`

, which comes from `pandas`

library.

```
import pandas
pandas.Series(list_revenue)
```

```
>>>
0 100
1 134
2 87
3 99
dtype: int64
```

```
series_revenue = pandas.Series(list_revenue)
```

Now we have a powerful object that can perform the `.sum()`

:

```
series_revenue.sum()
```

```
>>> 420
```

Within the Series, we can find more objects.

```
series_revenue
```

```
>>>
0 100
1 134
2 87
3 99
dtype: int64
```

```
series_revenue.index
```

```
>>> RangeIndex(start=0, stop=4, step=1)
```

Let's change the elements of the index:

```
series_revenue.index = ['1st Month', '2nd Month', '3rd Month', '4th Month']
```

```
series_revenue
```

```
>>>
1st Month 100
2nd Month 134
3rd Month 87
4th Month 99
dtype: int64
```

```
series_revenue.values
```

```
>>> array([100, 134, 87, 99])
```

```
series_revenue.name
```

The `Series`

doesn't contain a name. Let's define it:

```
series_revenue.name = 'Revenue'
```

```
series_revenue
```

```
>>>
1st Month 100
2nd Month 134
3rd Month 87
4th Month 99
Name: Revenue, dtype: int64
```

The values of the Series (right-hand side) are determined by their **data type** (alias `dtype`

):

```
series_revenue.dtype
```

```
>>> dtype('float64')
```

Let's change the values' dtype to be `float`

(decimal numbers)

```
series_revenue.astype(float)
```

```
>>>
1st Month 100.0
2nd Month 134.0
3rd Month 87.0
4th Month 99.0
Name: Revenue, dtype: float64
```

```
series_revenue = series_revenue.astype(float)
```

What else could we do with the Series object?

```
series_revenue.describe()
```

```
>>>
count 4.000000
mean 105.000000
std 20.215506
min 87.000000
25% 96.000000
50% 99.500000
75% 108.500000
max 134.000000
Name: Revenue, dtype: float64
```

```
series_revenue.plot.bar();
```

```
series_revenue.plot.barh();
```

```
series_revenue.plot.pie();
```

The `DataFrame`

is a set of Series.

We will create another Series `series_expenses`

to later put them together into a DataFrame.

```
pandas.Series(
data=[20, 23, 21, 18],
index=['1st Month','2nd Month','3rd Month','4th Month'],
name='Expenses'
)
```

```
>>>
1st Month 20
2nd Month 23
3rd Month 21
4th Month 18
Name: Expenses, dtype: int64
```

```
series_expenses = pandas.Series(
data=[20, 23, 21, 18],
index=['1st Month','2nd Month','3rd Month','4th Month'],
name='Expenses'
)
```

```
pandas.DataFrame(data=[series_revenue, series_expenses])
```

```
df_shop = pandas.DataFrame(data=[series_revenue, series_expenses])
```

Let's transpose the DataFrame to have the variables in columns:

```
df_shop.transpose()
```

```
df_shop = df_shop.transpose()
```

```
df_shop.index
```

```
>>> Index(['1st Month', '2nd Month', '3rd Month', '4th Month'], dtype='object')
```

```
df_shop.columns
```

```
>>> Index(['Revenue', 'Expenses'], dtype='object')
```

```
df_shop.values
```

```
>>>
array([[100., 20.],
[134., 23.],
[ 87., 21.],
[ 99., 18.]])
```

```
df_shop.shape
```

```
>>> (4, 2)
```

What else could we do with the DataFrame object?

```
df_shop.describe()
```

```
df_shop.plot.bar();
```

```
df_shop.plot.pie(subplots=True);
```

```
df_shop.plot.line();
```

```
df_shop.plot.area();
```

We could also export the DataFrame to formatted data files:

```
df_shop.to_excel('data.xlsx')
```

```
df_shop.to_csv('data.csv')
```

```
url = 'https://raw.githubusercontent.com/jsulopzs/data/main/football_players_stats.json'
pandas.read_json(url, orient='index')
```

```
df_football = pandas.read_json(url, orient='index')
```

```
df_football.Goals.plot.pie();
```

```
url = 'https://raw.githubusercontent.com/jsulopzs/data/main/best_tennis_players_stats.json'
pandas.read_json(path_or_buf=url, orient='index')
```

```
df_tennis = pandas.read_json(path_or_buf=url, orient='index')
```

```
df_tennis.style.background_gradient()
```

```
df_tennis.plot.pie(subplots=True, layout=(2,3), figsize=(10,6));
```

```
pandas.read_html('https://www.skysports.com/la-liga-table/2021', index_col='Team')[0]
```

```
df_laliga = pandas.read_html('https://www.skysports.com/la-liga-table/2021', index_col='Team')[0]
```

```
df_laliga.Pts.plot.barh();
```

```
df_laliga.Pts.sort_values().plot.barh();
```

```
url = 'https://raw.githubusercontent.com/jsulopzs/data/main/internet_usage_spain.csv'
pandas.read_csv(filepath_or_buffer=url)
```

```
df_internet = pandas.read_csv(filepath_or_buffer=url)
```

```
df_internet.hist();
```

```
df_internet.pivot_table(index='education', columns='internet_usage', aggfunc='size')
```

```
dfres = df_internet.pivot_table(index='education', columns='internet_usage', aggfunc='size')
```

```
dfres.style.background_gradient('Greens', axis=1)
```

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Don't miss out on his posts on **LinkedIn** to become a more efficient Python developer.

Machine Learning is a field that focuses on **getting a mathematical equation** to make predictions. Although not all Machine Learning models work the same way.

Which types of Machine Learning models can we distinguish so far?

**Classifiers**to predict**Categorical Variables****Regressors**to predict**Numerical Variables**

The previous chapter covered the explanation of a Regressor model: Linear Regression.

This chapter covers the explanation of a Classification model: the Decision Tree.

Why do they belong to Machine Learning?

The Machine wants to get the best numbers of a mathematical equation such that

**the difference between reality and predictions is minimum**:**Classifier**evaluates the model based on**prediction success rate**$$ y \stackrel{?}{=} \hat y $$**Regressor**evaluates the model based on the**distance between real data and predictions**(residuals) $$ y - \hat y $$

There are many Machine Learning Models of each type.

You don't need to know the process behind each model because they all work the same way (see article). In the end, you will choose the one that makes better predictions.

This tutorial will show you how to develop a Decision Tree to calculate the probability of a person surviving the Titanic and the different evaluation metrics we can calculate on Classification Models.

**Table of Important Content**

- 馃泙 How to preprocess/clean the data to fit a Machine Learning model?
- Dummy Variables
- Missing Data

- 馃ぉ How to
**visualize**a Decision Tree model in Python step by step? - 馃 How to
**interpret**the nodes and leaf's values of a Decision Tree plot? - How to
**evaluate**Classification models?- Accuracy
- Confussion Matrix
- Sensitivity
- Specificity
- ROC Curve

- 馃弫 How to compare Classification models to choose the best one?

- This dataset represents
**people**(rows) aboard the Titanic - And their
**sociological characteristics**(columns)

```
import seaborn as sns #!
import pandas as pd
df_titanic = sns.load_dataset(name='titanic')[['survived', 'sex', 'age', 'embarked', 'class']]
df_titanic
```

We should know from the previous chapter that we need a function accessible from a Class in the library `sklearn`

.

```
from sklearn.tree import DecisionTreeClassifier
```

To create a copy of the original's code blueprint to not "modify" the source code.

```
model_dt = DecisionTreeClassifier()
```

The theoretical action we'd like to perform is the same as we executed in the previous chapter. Therefore, the function should be called the same way:

```
model_dt.fit()
```

```
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/var/folders/24/tg28vxls25l9mjvqrnh0plc80000gn/T/ipykernel_3553/3699705032.py in <module>
----> 1 model_dt.fit()
TypeError: fit() missing 2 required positional arguments: 'X' and 'y'
```

Why is it asking for two parameters: `y`

and `X`

?

`y`

: target ~ independent ~ label ~ class variable`X`

: explanatory ~ dependent ~ feature variables

```
target = df_titanic['survived']
explanatory = df_titanic.drop(columns='survived')
```

```
model_dt.fit(X=explanatory, y=target)
```

```
---------------------------------------------------------------------------
ValueError: could not convert string to float: 'male'
```

Most of the time, the data isn't prepared to fit the model. So let's dig into why we got the previous error in the following sections.

The error says:

```
ValueError: could not convert string to float: 'male'
```

From which we can interpret that the function `.fit()`

does **not accept values of string type** like the ones in

`sex`

column:```
df_titanic
```

Therefore, we need to convert the categorical columns to **dummies** (0s & 1s):

```
pd.get_dummies(df_titanic, drop_first=True)
```

```
df_titanic = pd.get_dummies(df_titanic, drop_first=True)
```

We separate the variables again to take into account the latest modification:

```
explanatory = df_titanic.drop(columns='survived')
target = df_titanic[['survived']]
```

Now we should be able to fit the model:

```
model_dt.fit(X=explanatory, y=target)
```

```
---------------------------------------------------------------------------
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
```

The data passed to the function contains **missing data** (`NaN`

). Precisely 177 people from which we don't have the age:

```
df_titanic.isna()
```

```
df_titanic.isna().sum()
```

```
survived 0
age 177
sex_male 0
embarked_Q 0
embarked_S 0
class_Second 0
class_Third 0
dtype: int64
```

Who are the people who lack the information?

```
mask_na = df_titanic.isna().sum(axis=1) > 0
```

```
df_titanic[mask_na]
```

What could we do with them?

- Drop the people (rows) who miss the age from the dataset.
- Fill the age by the average age of other combinations (like males who survived)
- Apply an algorithm to fill them.

We'll choose **option 1 to simplify the tutorial**.

Therefore, we go from 891 people:

```
df_titanic
```

To 714 people:

```
df_titanic.dropna()
```

```
df_titanic = df_titanic.dropna()
```

We separate the variables again to take into account the latest modification:

```
explanatory = df_titanic.drop(columns='survived')
target = df_titanic['survived']
```

Now we shouldn't have any more trouble with the data to fit the model.

We don't get any errors because we correctly preprocess the data for the model.

Once the model is fitted, we may observe that the object contains more attributes because it has calculated the best numbers for the mathematical equation.

```
model_dt.fit(X=explanatory, y=target)
model_dt.__dict__
```

```
{'criterion': 'gini',
'splitter': 'best',
'max_depth': None,
'min_samples_split': 2,
'min_samples_leaf': 1,
'min_weight_fraction_leaf': 0.0,
'max_features': None,
'max_leaf_nodes': None,
'random_state': None,
'min_impurity_decrease': 0.0,
'class_weight': None,
'ccp_alpha': 0.0,
'feature_names_in_': array(['age', 'sex_male', 'embarked_Q', 'embarked_S', 'class_Second',
'class_Third'], dtype=object),
'n_features_in_': 6,
'n_outputs_': 1,
'classes_': array([0, 1]),
'n_classes_': 2,
'max_features_': 6,
'tree_': <sklearn.tree._tree.Tree at 0x16612cce0>}
```

We have a fitted `DecisionTreeClassifier`

. Therefore, we should be able to apply the mathematical equation to the original data to get the predictions:

```
model_dt.predict_proba(X=explanatory)[:5]
```

```
array([[0.82051282, 0.17948718],
[0.05660377, 0.94339623],
[0.53921569, 0.46078431],
[0.05660377, 0.94339623],
[0.82051282, 0.17948718]])
```

Let's create a new `DataFrame`

to keep the information of the target and predictions to understand the topic better:

```
df_pred = df_titanic[['survived']].copy()
```

And add the predictions:

```
df_pred['pred_proba_dt'] = model_dt.predict_proba(X=explanatory)
df_pred
```

How have we calculated those predictions?

The **Decision Tree** model doesn't specifically have a mathematical equation. But instead, a set of conditions is represented in a tree:

```
from sklearn.tree import plot_tree
plot_tree(decision_tree=model_dt);
```

There are many conditions; let's recreate a shorter tree to explain the Mathematical Equation of the Decision Tree:

```
model_dt = DecisionTreeClassifier(max_depth=2)
model_dt.fit(X=explanatory, y=target)
plot_tree(decision_tree=model_dt);
```

Let's make the image bigger:

```
import matplotlib.pyplot as plt
plt.figure(figsize=(10,6))
plot_tree(decision_tree=model_dt);
```

The conditions are `X[2]<=0.5`

. The `X[2]`

means the 3rd variable (Python starts counting at 0) of the explanatory ones. If we'd like to see the names of the columns, we need to add the `feature_names`

parameter:

```
explanatory.columns
```

```
Index(['age', 'sex_male', 'embarked_Q', 'embarked_S', 'class_Second',
'class_Third'],
dtype='object')
```

```
import matplotlib.pyplot as plt
plt.figure(figsize=(10,6))
plot_tree(decision_tree=model_dt, feature_names=explanatory.columns);
```

Let's add some colours to see how the predictions will go based on the fulfilled conditions:

```
import matplotlib.pyplot as plt
plt.figure(figsize=(10,6))
plot_tree(decision_tree=model_dt, feature_names=explanatory.columns, filled=True);
```

The Decision Tree and the Linear Regression algorithms look for the best numbers in a mathematical equation. The following video explains how the Decision Tree configures the equation:

Let's take a person from the data to explain how the model makes a prediction. For storytelling, let's say the person's name is John.

John is a 22-year-old man who took the titanic on 3rd class but didn't survive:

```
df_titanic[:1]
```

To calculate the chances of survival in a person like John, we pass the explanatory variables of John:

```
explanatory[:1]
```

To the function `.predict_proba()`

and get a probability of 17.94%:

```
model_dt.predict_proba(X=explanatory[:1])
```

```
array([[0.82051282, 0.17948718]])
```

But wait, how did we get to the probability of survival of 17.94%?

Let's explain it step-by-step with the Decision Tree visualization:

```
plt.figure(figsize=(10,6))
plot_tree(decision_tree=model_dt, feature_names=explanatory.columns, filled=True);
```

Based on the tree, the conditions are:

- sex_male (John=1) <= 0.5 ~ False

John doesn't fulfil the condition; we move to the right side of the tree.

- age (John=22.0) <= 6.5 ~ False

John doesn't fulfil the condition; we move to the right side of the tree.

The ultimate node, the leaf, tells us that the training dataset contained 429 males older than 6.5 years old.

Out of the 429, 77 survived, but 352 didn't make it.

Therefore, the chances of John surviving according to our model are 77 divided by 429:

```
77/429
```

```
0.1794871794871795
```

We get the same probability; John had a 17.94% chance of surviving the Titanic accident.

As always, we should have a function to calculate the goodness of the model:

```
model_dt.score(X=explanatory, y=target)
```

```
0.8025210084033614
```

The model can correctly predict 80.25% of the people in the dataset.

What's the reasoning behind the model's evaluation?

As we saw earlier, the classification model calculates the probability for an event to occur. The function `.predict_proba()`

gives us two probabilities in the columns: people who didn't survive (0) and people who survived (1).

```
model_dt.predict_proba(X=explanatory)[:5]
```

```
array([[0.82051282, 0.17948718],
[0.05660377, 0.94339623],
[0.53921569, 0.46078431],
[0.05660377, 0.94339623],
[0.82051282, 0.17948718]])
```

We take the positive probabilities in the second column:

```
df_pred['pred_proba_dt'] = model_dt.predict_proba(X=explanatory)[:, 1]
```

At the time to compare reality (0s and 1s) with the predictions (probabilities), we need to turn probabilities higher than 0.5 into 1, and 0 otherwise.

```
import numpy as np
df_pred['pred_dt'] = np.where(df_pred.pred_proba_dt > 0.5, 1, 0)
df_pred
```

The simple idea of the accuracy is to get the success rate on the classification: how many people do we get right?

We compare if the reality is equal to the prediction:

```
comp = df_pred.survived == df_pred.pred_dt
comp
```

```
0 True
1 True
...
889 False
890 True
Length: 714, dtype: bool
```

If we sum the boolean Series, Python will take True as 1 and 0 as False to compute the number of correct classifications:

```
comp.sum()
```

```
573
```

We get the score by dividing the successes by all possibilities (the total number of people):

```
comp.sum()/len(comp)
```

```
0.8025210084033614
```

It is also correct to do the mean on the comparisons because it's the sum divided by the total. Observe how you get the same number:

```
comp.mean()
```

```
0.8025210084033614
```

But it's more efficient to calculate this metric with the function `.score()`

:

```
model_dt.score(X=explanatory, y=target)
```

```
0.8025210084033614
```

Can we think that our model is 80.25% of good and be happy with it?

- We should not because we might be interested in the accuracy of each class (survived or not) separately. But first, we need to compute the confusion matrix:

```
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(
y_true=df_pred.survived,
y_pred=df_pred.pred_dt
)
CM = ConfusionMatrixDisplay(cm)
CM.plot();
```

- Looking at the first number of the confusion matrix, we have 407 people who didn't survive the Titanic in reality and the predictions.
- It is not the case with the number 17. Our model classified 17 people as survivors when they didn't.
- The success rate of the negative class, people who didn't survive, is called the
**specificity**: $407/(407+17)$. - Whereas the success rate of the positive class, people who did survive, is called the
**sensitivity**: $166/(166+124)$.

```
cm[0,0]
```

```
407
```

```
cm[0,:]
```

```
array([407, 17])
```

```
cm[0,0]/cm[0,:].sum()
```

```
0.9599056603773585
```

```
sensitivity = cm[0,0]/cm[0,:].sum()
```

```
cm[1,1]
```

```
166
```

```
cm[1,:]
```

```
array([124, 166])
```

```
cm[1,1]/cm[1,:].sum()
```

```
0.5724137931034483
```

```
sensitivity = cm[1,1]/cm[1,:].sum()
```

We could have gotten the same metrics using the function `classification_report()`

. Look a the recall (column) of rows 0 and 1, specificity and sensitivity, respectively:

```
from sklearn.metrics import classification_report
report = classification_report(
y_true=df_pred.survived,
y_pred=df_pred.pred_dt
)
print(report)
```

```
precision recall f1-score support
0 0.77 0.96 0.85 424
1 0.91 0.57 0.70 290
accuracy 0.80 714
macro avg 0.84 0.77 0.78 714
weighted avg 0.82 0.80 0.79 714
```

We can also create a nice `DataFrame`

to later use the data for simulations:

```
report = classification_report(
y_true=df_pred.survived,
y_pred=df_pred.pred_dt,
output_dict=True
)
pd.DataFrame(report)
```

Our model is not as good as we thought if we predict the people who survived; we get 57.24% of survivors right.

How can we then assess a reasonable rate for our model?

Watch the following video to understand how the Area Under the Curve (AUC) is a good metric because it sort of combines accuracy, specificity and sensitivity:

We compute this metric in Python as follows:

```
import matplotlib.pyplot as plt
import numpy as np
from sklearn import metrics
y = df_pred.survived
pred = model_dt.predict_proba(X=explanatory)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y, pred)
roc_auc = metrics.auc(fpr, tpr)
display = metrics.RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc,
estimator_name='example estimator')
display.plot()
plt.show()
```

```
roc_auc
```

```
0.8205066688353937
```

Let's build other classification models by applying the same functions. In the end, computing Machine Learning models is the same thing all the time.

`RandomForestClassifier()`

in Python```
from sklearn.ensemble import RandomForestClassifier
model_rf = RandomForestClassifier()
model_rf.fit(X=explanatory, y=target)
```

```
RandomForestClassifier()
```

```
df_pred['pred_rf'] = model_rf.predict(X=explanatory)
df_pred
```

```
model_rf.score(X=explanatory, y=target)
```

```
0.9117647058823529
```

`SVC()`

in Python```
from sklearn.svm import SVC
model_sv = SVC()
model_sv.fit(X=explanatory, y=target)
```

```
SVC()
```

```
df_pred['pred_sv'] = model_sv.predict(X=explanatory)
df_pred
```

```
model_sv.score(X=explanatory, y=target)
```

```
0.6190476190476191
```

To simplify the explanation, we use accuracy as the metric to compare the models. We have the Random Forest as the best model with an accuracy of 91.17%.

```
model_dt.score(X=explanatory, y=target)
```

```
0.8025210084033614
```

```
model_rf.score(X=explanatory, y=target)
```

```
0.9117647058823529
```

```
model_sv.score(X=explanatory, y=target)
```

```
0.6190476190476191
```

```
df_pred.head(10)
```

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

What do I need to start a

`Django`

project?

- It is recommended that you create a new environment
- And that
**you have**. If not, click here to download & install`Anaconda`

installed You need to install the library in your

`terminal`

**(use Anaconda Prompt for Windows Users)**:`conda create -n django_env django conda activate django_env`

Ok, you got it. What's next?

**Open a Code Editor application**to start working more comfortable with the project- I use Visual Studio Code (aka VSCode), you may download & install it here

What should I do within VSCode?

- You will use the
`Django CLI`

installed with the`Django`

package already - To
**create the standard folders and files**you need for the application Type the following line within the

`terminal`

:`django-admin startproject shop`

What should I see on my computer after this?

- If you open your
`user folder`

, you will see that - A folder
`shop`

has been created `drag & drop`

it to VSCode- Now check the folder structure and familiarize yourself with the files & folders
- The folder structure should look like this

```
- shop/
- manage.py
- shop/
- __init__.py
- settings.py
- urls.py
- asgi.py
- wsgi.py
```

Do I need to study all of them?

- No, just go with the flow, and you'll get to understand everything at the end

Ok, what's the next step?

- You'll probably want to see your Django App up and running, right?
- Then, go over the
`terminal`

and write the following

```
cd shop
python manage.py runserver
```

- A local server has opened in http://127.0.0.1:8000/, open it in a
`web browser`

- Which references the
`localhost`

and you should see something like this

What if I try another

`URL`

like http://127.0.0.1:8000/products?

- You will receive an
**error**because - You didn't tell Django what to do when you go to http://127.0.0.1:8000/products

How can I tell that to Django?

`shop`

Django ProjectWith the following line of code

`python manage.py startup products`

- Create an URL within the file
`shop > urls.py`

```
from django.contrib import admin
from django.urls import path, include # modified
urlpatterns = [
path('products/', include('products.urls')), # added
path('admin/', admin.site.urls),
]
```

Create a

`View`

(HTML Code) to be recognised when you go to the`URL`

http://127.0.0.1:8000/productsWithin the file

`shop > products > views.py`

```
from django.http import HttpResponse
def view_for_products(request):
return HttpResponse("This function will render `HTML` code that makes you see this <p style='color: red'>text in red</p>.")
```

See this tutorial if you want to know a bit more about

`HTML`

Call the function

`view_for_products`

when you click on http://127.0.0.1:8000/productsYou need to create the file

`urls.py`

within products`shop > products > urls.py`

```
from django.urls import path
from . import views
urlpatterns = [
path('', views.view_for_products, name='index'),
]
```

Why do we reference the

`URLs`

in two files? One in`shop/urls.py`

folder and the other in`products/urls.py`

?

- It is a best practice to have a
`Django`

project separated by different`Apps`

- In this case, we created the
`products`

App - In our the file
`shop/urls.py`

, you reference the`products.py`

URLs here

```
urlpatterns = [
path('products/', include('products.urls')), #here
path('admin/', admin.site.urls),
]
```

- So that at the time you navigate to
`https://127.0.0.1:8000/products`

- You will have access to the URLs defined in
`shop/products/urls.py`

- For example, let's create another View in
`shop/products/views.py`

```
def new_view(request):
return HttpResponse('This is the <strong>new view</strong>')
```

- And reference it in the file
`shop/products/urls.py`

```
from django.urls import path
from . import views
urlpatterns = [
path('', views.view_for_products, name='index'),
path('pepa', views.new_view, name='pepa'), # new url
]
```

- We don't need to reference the View in
`shop/urls.py`

since - we can access all URLs in
`shop/products/urls.py`

at the time we wrote `include('products.urls')`

in the file`shop/urls.py`

- Try to go to https://127.0.0.1:8000/products/pepa

So, each time I want to create a different

`HTML`

, do I need to create a View?

Yes, it's how the Model View Template (MVT) works

You introduce an

`URL`

- The
`URL`

activates a`View`

- And
`HTML`

code gets rendered in the website

Why don't you mention anything about the

`model`

?

- Well, that's something to cover in the following article 馃敟 COMING SOON!

Any doubts?

Let me know in the comments; I'd be happy to help!

]]>Ask him any doubt on **Twitter** or **LinkedIn**

We used just two variables out of the seven we had in the whole DataFrame.

We could have computed better cluster models by giving more information to the Machine Learning model. Nevertheless, it would have been **harder to plot seven variables with seven axes in a graph**.

Is there anything we can do to compute a clustering model with more than two variables and later represent all the points along with their variables?

- Yes, everything is possible with data. As one of my teachers told me: "you can torture the data until it gives you what you want" (sometimes it's unethical, so behave).

We'll develop the code to show you the need for **dimensionality reduction** techniques. Specifically, the Principal Component Analysis (PCA).

Imagine for a second you are the president of the United States of America, and you are considering creating campaigns to reduce **car accidents**.

You won't create 51 TV campaigns, one for each of the **States of the USA** (rows). Instead, you will see which States behave similarly to cluster them into 3 groups based on the variation across their features (columns).

```
import seaborn as sns #!
df_crashes = sns.load_dataset(name='car_crashes', index_col='abbrev')
df_crashes
```

Check this website to understand the measures of the following data.

From the previous chapter, we should know that we need to preprocess the Data so that variables with different scales can be compared.

For example, it is not the same to increase 1kg of weight than 1m of height.

We will use `StandardScaler()`

algorithm:

```
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_scaled = scaler.fit_transform(df_crashes)
data_scaled[:5]
```

```
array([[ 0.73744574, 1.1681476 , 0.43993758, 1.00230055, 0.27769155,
-0.58008306, 0.4305138 ],
[ 0.56593556, 1.2126951 , -0.21131068, 0.60853209, 0.80725756,
0.94325764, -0.02289992],
[ 0.68844283, 0.75670887, 0.18761539, 0.45935701, 1.03314134,
0.0708756 , -0.98177845],
[ 1.61949811, -0.48361373, 0.54740815, 1.67605228, 1.95169961,
-0.33770122, 0.32112519],
[-0.92865317, -0.39952407, -0.8917629 , -0.594276 , -0.89196792,
-0.04841772, 1.26617765]])
```

Let's turn the array into a DataFrame for better understanding:

```
import pandas as pd
df_scaled = pd.DataFrame(data_scaled, index=df_crashes.index, columns=df_crashes.columns)
df_scaled
```

Now we see all the variables having the same scale (i.e., around the same limits):

```
df_scaled.agg(['min', 'max'])
```

We follow the usual Scikit-Learn procedure to develop Machine Learning models.

```
from sklearn.cluster import KMeans
```

```
model_km = KMeans(n_clusters=3)
```

```
model_km.fit(X=df_scaled)
```

```
KMeans(n_clusters=3)
```

```
model_km.predict(X=df_scaled)
```

```
array([1, 1, 1, 1, 2, 0, 2, 1, 2, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 2, 2,
2, 2, 0, 1, 1, 0, 0, 0, 2, 0, 2, 1, 1, 0, 1, 0, 1, 2, 1, 1, 1, 1,
0, 0, 0, 0, 1, 0, 1], dtype=int32)
```

```
df_pred = df_scaled.copy()
```

```
df_pred.insert(0, 'pred', model_km.predict(X=df_scaled))
df_pred
```

Now let's visualize the clusters with a 2-axis plot:

```
sns.scatterplot(x='total', y='speeding', hue='pred',
data=df_pred, palette='Set1');
```

Does the visualization make sense?

- No, because the clusters should separate their points from others. Nevertheless, we see some green points in the middle of the blue cluster.

Why is this happening?

- We are
**just representing 2 variables**where the model was**fitted with 7 variables**. We can't see the points separated as we miss 5 variables in the plot.

Why don't we add 5 variables to the plot then?

- We could, but it'd be a way too hard to interpret.

Then, what could we do?

- We can apply PCA, a dimensionality reduction technique. Take a look at the following video to understand this concept:

`PCA()`

`PCA()`

is another technique used to transform data.

How has the data been manipulated so far?

- Original Data
`df_crashes`

```
df_crashes
```

- Normalized Data
`df_scaled`

```
df_scaled
```

- Principal Components Data
`dfpca`

(now)

```
from sklearn.decomposition import PCA
pca = PCA()
data_pca = pca.fit_transform(df_scaled)
data_pca[:5]
```

```
array([[ 1.60367129, 0.13344927, 0.31788093, -0.79529296, -0.57971878,
0.04622256, 0.21018495],
[ 1.14421188, 0.85823399, 0.73662642, 0.31898763, -0.22870123,
-1.00262531, 0.00896585],
[ 1.43217197, -0.42050562, 0.3381364 , 0.55251314, 0.16871805,
-0.80452278, -0.07610742],
[ 2.49158352, 0.34896812, -1.78874742, 0.26406388, -0.37238226,
-0.48184939, -0.14763646],
[-1.75063825, 0.63362517, -0.1361758 , -0.97491605, -0.31581147,
0.17850962, -0.06895829]])
```

```
df_pca = pd.DataFrame(data_pca)
df_pca
```

```
cols_pca = [f'PC{i}' for i in range(1, pca.n_components_+1)]
cols_pca
```

```
['PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6', 'PC7']
```

```
df_pca = pd.DataFrame(data_pca, columns=cols_pca, index=df_crashes.index)
df_pca
```

Let's visualize a **scatterplot** with `PC1`

& `PC2`

and colour points by cluster:

```
import plotly.express as px
px.scatter(data_frame=df_pca, x='PC1', y='PC2', color=df_pred.pred)
```

Are they **mixed** now?

- No, they aren't.

That's because both PC1 and PC2 represent almost 80% of the variability of the original seven variables.

You can see the following array, where every element represents the amount of variability explained by every component:

```
pca.explained_variance_ratio_
```

```
array([0.57342168, 0.22543042, 0.07865743, 0.05007557, 0.04011 ,
0.02837999, 0.00392491])
```

And the accumulated variability (79.88% until PC2):

```
pca.explained_variance_ratio_.cumsum()
```

```
array([0.57342168, 0.7988521 , 0.87750953, 0.9275851 , 0.9676951 ,
0.99607509, 1. ])
```

Which variables represent these two components?

The Principal Components are produced by a **mathematical equation** (once again), which is composed of the following weights:

```
df_weights = pd.DataFrame(pca.components_.T, columns=df_pca.columns, index=df_scaled.columns)
df_weights
```

We can observe that:

- Socio-demographical features (total, speeding, alcohol, not_distracted & no_previous) have higher coefficients (higher influence) in PC1.
- Whereas insurance features (ins_premium & ins_losses) have higher coefficients in PC2.

Principal Components is a technique that gathers the maximum variability of a set of features (variables) into Components.

Therefore, the two first Principal Components accurate a good amount of common data because we see two sets of variables that are correlated with each other:

```
df_corr = df_scaled.corr()
sns.heatmap(df_corr, annot=True, vmin=0, vmax=1);
```

I hope that everything is making sense so far.

To ultimate the explanation, you can see below how `df_pca`

values are computed:

For example, we can multiply the weights of PC1 with the original variables for **AL**abama:

```
(df_weights['PC1']*df_scaled.loc['AL']).sum()
```

```
1.6036712920638672
```

To get the transformed value of the Principal Component 1 for **AL**abama State:

```
df_pca.head()
```

The same operation applies to any value of

`df_pca`

.

Now, let's go back to the PCA plot:

```
px.scatter(data_frame=df_pca, x='PC1', y='PC2', color=df_pred.pred.astype(str))
```

How can we interpret the clusters with the components?

Let's add information to the points thanks to animated plots from `plotly`

library:

```
hover = '''
<b>%{customdata[0]}</b><br><br>
PC1: %{x}<br>
Total: %{customdata[1]}<br>
Alcohol: %{customdata[2]}<br><br>
PC2: %{y}<br>
Ins Losses: %{customdata[3]}<br>
Ins Premium: %{customdata[4]}
'''
fig = px.scatter(data_frame=df_pca, x='PC1', y='PC2',
color=df_pred.pred.astype(str),
hover_data=[df_pca.index, df_crashes.total, df_crashes.alcohol,
df_crashes.ins_losses, df_crashes.ins_premium])
fig.update_traces(hovertemplate = hover)
```

If you hover the mouse over the two most extreme points along the x-axis, you can see that their values coincide with the `min`

and `max`

values across socio-demographical features:

```
df_crashes.agg(['min', 'max'])
```

```
df_crashes.loc[['DC', 'SC'],:]
```

Apply the same reasoning over the two most extreme points along the y-axis. You will see the same for the *insurance* variables because they determine the positioning of the PC2 (y-axis).

```
df_crashes.agg(['min', 'max'])
```

```
df_crashes.loc[['ID', 'LA'],:]
```

Is there a way to represent the weights of the original data for the Principal Components and the points?

That's called a Biplot, which we will see later.

We can observe how we position the points along the loadings vectors. Friendly reminder: the loading vectors are the weights of the original variables in each Principal Component.

```
import numpy as np
loadings = pca.components_.T * np.sqrt(pca.explained_variance_)
evr = pca.explained_variance_ratio_.round(2)
fig = px.scatter(df_pca, x='PC1', y='PC2',
color=model_km.labels_.astype(str),
hover_name=df_pca.index,
labels={
'PC1': f'PC1 ~ {evr[0]}%',
'PC2': f'PC2 ~ {evr[1]}%'
})
for i, feature in enumerate(df_scaled.columns):
fig.add_shape(
type='line',
x0=0, y0=0,
x1=loadings[i, 0],
y1=loadings[i, 1],
line=dict(color="red",width=3)
)
fig.add_annotation(
x=loadings[i, 0],
y=loadings[i, 1],
ax=0, ay=0,
xanchor="center",
yanchor="bottom",
text=feature,
)
fig.show()
```

Dimensionality Reduction techniques have many more applications, but I hope you got the essence: they are great for grouping variables that behave similarly and later visualising many variables in just one component.

In short, you are simplifying the information of the data. In this example, we simplify the data from plotting seven to only two dimensions. Although we don't get this for free because we explain around 80% of the data's original variability.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Machine Learning Models are deployed to, for example:

- Predict objects within an image (
**Tesla**) so that the car can take actions **Spotify**recommends songs to a user so that you'd fall in love with the service- Most likely to interact posts in
**Facebook or Twitter**so that you will spend more time on the app

If you just care about getting the code to make this happen, you can forget the storytelling and get right into those lines in GitHub

If you want to follow the tutorial and understand the topic in depth, let's get started

Let's say that **we are a car sales company** and we want to make things easier for clients when they decide which car to buy.

They usually don't want to have a car that **consumes lots of fuel** `mpg`

.

Nevertheless, *they won't know this until they use the car*.

Is there a way to ** predict the consumption** based on other characteristics of the car?

- Yes, with a mathematical formula, for example:

```
consumption = 2 + 3 * acceleration * 2.1 horsepower
```

We have **historical data** from all cars models we have sold over the past few years.

We could use this **data to calculate the BEST mathematical formula**.

And `deploy it to a website`

with a form to solve the consumption question by themselves.

To make this happen, we will follow the structure:

- Create ML Model Object in Python
- Create an HTML Form
- Create Flask App
- Deploy to Heroku
- Visit Website and Make a Prediction

- This dataset contains information about
**car models**(rows) - For which we have some
**characteristics**(columns)

```
import seaborn as sns
df = sns.load_dataset(name='mpg', index_col='name')[['acceleration', 'weight', 'mpg']]
df.sample(5)
```

acceleration | weight | mpg | |
---|---|---|---|

name | |||

subaru | 17.8 | 2065 | 32.3 |

bmw 2002 | 12.5 | 2234 | 26.0 |

audi 5000 | 15.9 | 2830 | 20.3 |

toyota corolla 1200 | 21.0 | 1836 | 32.0 |

ford gran torino (sw) | 16.0 | 4638 | 14.0 |

```
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X=df[['acceleration', 'weight']], y=df['mpg'])
model.__dict__
```

```
{'fit_intercept': True,
'normalize': False,
'copy_X': True,
'n_jobs': None,
'positive': False,
'n_features_in_': 2,
'coef_': array([ 0.25081589, -0.00733564]),
'_residues': 7317.984100916719,
'rank_': 2,
'singular_': array([16873.21840634, 49.92970477]),
'intercept_': 41.39982830200016}
```

And the BEST mathematical formula is:

```
consumption = 41.39 + 0.25 * acceleration - 0.0073 * weight
```

`LinearRegression()`

into a File- The object
`LinearRegression()`

contains the Mathematical Formula - That we will use in the website to make the
`prediction`

```
import pickle
with open('linear_regression_model.pkl', 'wb') as f:
pickle.dump(model, f)
```

Now a file called `linear_regression_model.pkl`

should appear in the **same folder that your script**.

All websites that you see online are displayed through an HTML file.

Therefore, we need to create an HTML file that contains a `form`

for the user to **input the data**.

And **calculate the prediction for the fuel consumption**.

Website example here

- Let's head over a Code Editor (VSCode in my case) and create a new file called
`index.html`

You may download Visual Studio Code (VSCode) here

That should contain the following lines:

```
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Document</title>
</head>
<body>
<form>
<label for="acceleration">Acceleration (m/s^2):</label><br />
<input
type="number"
id="acceleration"
name="acceleration"
value="34"
/><br />
<label for="weight">Weight (kg):</label><br />
<input type="number" id="weight" name="weight" value="12" /><br /><br />
<input type="submit" value="Submit" />
</form>
</body>
</html>
```

If you open the file

`index.html`

in a browser, you will see the form.And the

`submit`

button that is supposed to calculate the prediction.Nevertheless, if you click, nothing will happen.

Because we need to develop the

`Flask`

application**to send the user input to a mathematical formula to calculate the prediction**and return that into the website.

As we are going to develop a whole application to a web server (Heroku), we need to create a **dedicated environment** with just the necessary packages.

- Let's head over the terminal and type the following commands:

```
python -m venv car_consumption_prediction
source car_consumption_prediction/bin/activate
```

- Now let's install the required packages:

```
pip install flask
pip install scikit-learn
```

Now you should open the folder

`car_consumption_prediction`

in a Code EditorAnd create a new folder

`app`

with two other folders inside:

```
- app
- model
- templates
```

- Then move the files we created before to its corresponding folders:

```
- app
- model
- linear_regression_model.pkl
- templates
- index.html
```

Now that we have the project structure, let's continue with the core functionality

We will build a **Python script that handles the user input** and make the prediction for fuel consumption

- So, create a new file within
`app`

folder called`app.py`

PS:This is the most important file in a`Flask`

app because itmanages everything.

```
- app
- model
- linear_regression_model.pkl
- templates
- index.html
- app.py
```

- And add the following lines of code:

```
import flask
import pickle
with open(f'model/linear_regression_model.pkl', 'rb') as f:
model = pickle.load(f)
app = flask.Flask(__name__, template_folder='templates')
@app.route('/', methods=['GET', 'POST'])
def main():
if flask.request.method == 'GET':
return(flask.render_template('index.html'))
elif flask.request.method == 'POST':
acceleration = flask.request.form['acceleration']
weight = flask.request.form['weight']
input_variables = [[acceleration, weight]]
prediction = model.predict(input_variables)[0]
return flask.render_template('index.html',
original_input={'Acceleration': acceleration,
'Weight': weight},
result=prediction,
)
if __name__ == '__main__':
app.run()
```

We need to pay attention to what's going on in the last

`return ...`

:The function

`render_template()`

is passing the objects from parameters`original_input`

and`result`

to`index.html`

Then, how can we use this variables in the file

`index.html`

?Copy-paste the following lines of code into

`index.html`

:

```
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Document</title>
</head>
<body>
<form action="{{ url_for('main') }}" method="POST">
<label for="acceleration">Acceleration (m/s^2):</label><br />
<input type="number" id="acceleration" name="acceleration" required /><br />
<label for="weight">Weight (kg):</label><br />
<input type="number" id="weight" name="weight" required /><br /><br />
<input type="submit" value="Submit" />
</form>
<br />
{% if result %}
<p>
The calculated fuel consumption is
<span style="color: orange">{{result}}</span>
</p>
{% endif %}
</body>
</html>
```

We made two changes to the file:

Specify the action to take when

`form`

is submitted:`<form action="{{ url_for('main') }}" method="POST">`

Show the prediction below the form

`{% if result %} <p> The calculated fuel consumption is <span style="color: orange">{{result}}</span> </p> {% endif %}`

In this case, we had to use the conditional `if`

to display `result`

if existed, as `result`

won't exist until the form is submitted and the `server`

computes the prediction in `app.py`

.

I did some research about an error in which Heroku wasn't working the way I expected

And found that I needed to add a `Procfile`

Create a file in the folder

`app`

called`procfile`

Write the following line and save the file:

`web: gunicorn app:app`

The folder structure will now be:

`- app - model - linear_regression_model.pkl - templates - index.html - app.py - procfile`

Install the

`gunicorn`

package in the virtual environment. In terminal:`pip install gunicorn`

Now it's the time to upload the application to Heroku so that anyone can get its prediction on fuel comsumption given a car's `acceleration`

and `weight`

.

- Create an Account in Heroku.
- Download Heroku CLI
- Create the Heroku App within the Terminal:

```
heroku create ml-model-deployment-car-mpg
```

This will be traduced into a website called https://ml-model-deployment-car-mpg.herokuapp.com/

PS:You should use a different name instead of`ml-model-deployment-car-mpg`

heroku will turn your repository into an`url`

.

**Commit the app files to your heroku hosting**.

`git init`

within`car_consumption_prediction`

folderCreate a

`requirements.txt`

file with the instruction for required packages. You could automatically create this by:`pip freeze > requirements.txt`

The folder structure will now be:

`- app - model - linear_regression_model.pkl - templates - index.html - app.py - procfile - requirements.txt`

Add the files for commit.

`git add .`

Commit the files to the remote

`git commit -m 'some random message' git push heroku master`

That's all the technical aspect.

Now if some user would like to use the app...

- Visit https://ml-model-deployment-car-mpg.herokuapp.com/
- Introduce some numbers in the form
- Submit and watch the prediction

- https://blog.cambridgespark.com/deploying-a-machine-learning-model-to-the-web-725688b851c7

Es cuando empiezas a buscar los msteres ms prestigiosos y relacionados con los puestos de trabajo ms demandados de hoy en da y encuentras algunos relacionados con la Inteligencia artificial y con Python; tambin otros sobre Big data o Machine Learning. Con tanta oferta, te pica la curiosidad y sopesas las distintas oportunidades. **Descubres un inters por la programacin** que antes estaba escondido, ya que la informtica nunca ha sido tu fuerte y eliges un mster adecuado a tus **nuevas inquietudes**. Por supuesto este mster te har destacar en el mercado laboral, facilitndote el camino hacia tu puesto deseado.

Una vez que has elegido el curso, solicitas la matrcula y te aceptan, lo que te genera una gran alegra y te da motivacin extra para empezar tus estudios. Al principio llevas todo al da, pero en un momento determinado las cosas se ponen **ms difciles de lo que esperabas** y te ves realmente perdida. En este momento, buscas **ayuda en google** y en la bibliografa que los profesores te aconsejan continuamente, y aun as no superas los problemas.

El mster merece la pena y tienes que terminarlo, por lo que decides buscar un profesor particular que te ayude con las prcticas y as salir adelante. Al principio te ves capaz de superar las prcticas a la vez que aprendes las soluciones; sin embargo, con el paso de los das, te das cuenta de que **no dispones del tiempo** suficiente para hacer las dos cosas: el trabajo se acumula. Aqu te pones nerviosa y empiezas a perder el sentido del aprendizaje, razn por la que empezaste este curso. **El ttulo (la titulitis) ocupa un lugar primordial en tu cabeza** y lo nico que quieres a estas alturas es poder acceder a las expectativas laborales que este te ofrece, por lo que, adems de los costes de la matrcula, te gastas una gran cantidad de dinero en el profesor que te ayuda a hacer las prcticas. Entretanto, te convences a ti misma de que en el futuro estudiars el contenido que ahora ests dejando en otras manos por la falta de tiempo, pero tu mantra se convierte en un **crculo vicioso del que no puedes salir**, ya que en cada entrevista de trabajo te exigen ejercicios previos a la obtencin del puesto que te resultan imposibles de realizar.

Tus ilusiones y sueos se desvanecen poco a poco.

Tras reflexionar seriamente, Pepa no entiende cmo ha podido llegar hasta esta situacin; cmo ha podido perder la confianza en tan poco tiempo. Lo que ella no sabe es que su caso es muy comn, que no es la nica persona que ha acabado as. En cambio, nosotros s lo sabemos, porque es la historia de muchos de nuestros clientes; es el problema de varias personas que han contactado con Sotstica. Para ello, contamos con el profesor Jess Lpez, uno de los profesores particulares mejor valorados en Espaa. Jess puede ayudaros a entender mejor el lenguaje de programacin Python con una aplicacin directa y dinmica a la Ciencia de Datos. Tras dar clase a ms de 300 personas y con ms de 3000 horas de docencia en estos dos ltimos aos, ha desarrollado una metodologa donde conecta todos los tpicos de la Ciencia de Datos y se asegura de que los comprendas para hacer cdigo por ti mismo. La duracin del programa gira en torno a las 25 horas, que se compaginan con sesiones explicativas, tareas a realizar por el alumno y sesiones de correccin.

Haz click aqu para ver las valoraciones de sus alumnos.

En nuestra pgina web podis ver el contenido del programa para convertirte en un Data Scientist creativo.

]]>- A visual representation of the data

Which data? How is it usually structured?

- In a table. For example:

```
import seaborn as sns
df = sns.load_dataset('mpg', index_col='name')
df.head()
```

How can you Visualice this `DataFrame`

?

- We could make a point for every car based on
- weight
- mpg

```
sns.scatterplot(x='weight', y='mpg', data=df);
```

Which conclusions can you make out of this plot?

Well, you may observe that the location of the points are descending as we move to the right

This means that the

`weight`

of the car may produce a lower capacity to make kilometres`mpg`

How can you measure this relationship?

- Linear Regression

```
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X=df[['weight']], y=df.mpg)
model.__dict__
```

- Resulting in

```
{'fit_intercept': True,
'normalize': False,
'copy_X': True,
'n_jobs': None,
'n_features_in_': 1,
'coef_': array([-0.00767661]),
'_residues': 7474.8140143821,
'rank_': 1,
'singular_': array([16873.20281508]),
'intercept_': 46.31736442026565}
```

Which is the mathematical formula for this relationship?

$$mpg = 46.31 - 0.00767 \cdot weight$$

- This equation means that the
`mpg`

gets 0.00767 units lower for**every unit**that`weight`

**increases**.

Could you visualise this equation in a plot?

- Absolutely, we could make the predictions from the original data and plot them.

```
y_pred = model.predict(X=df[['weight']])
dfsel = df[['weight', 'mpg']].copy()
dfsel['prediction'] = y_pred
dfsel.head()
```

weight | mpg | prediction | |
---|---|---|---|

name | |||

chevrolet chevelle malibu | 3504 | 18.0 | 19.418523 |

buick skylark 320 | 3693 | 15.0 | 17.967643 |

plymouth satellite | 3436 | 18.0 | 19.940532 |

amc rebel sst | 3433 | 16.0 | 19.963562 |

ford torino | 3449 | 17.0 | 19.840736 |

Out of this table, you could observe that predictions don't exactly match the reality, but it approximates.

For example, Ford Torino's

`mpg`

is 17.0, but our model predicts 19.84.

```
sns.scatterplot(x='weight', y='mpg', data=dfsel)
sns.scatterplot(x='weight', y='prediction', data=dfsel);
```

- The blue points represent the actual data.
- The orange points represent the predictions of the model.

]]>I teach Python, R, Statistics & Data Science. I like to produce content that helps people to understand these topics better.

Feel free and welcomed to give me feedback as I would like to make my tutorials clearer and generate content that interests you 馃

You can see my Tutor Profile here if you need Private Tutoring lessons.

It's tough to find things that always work the same way in programming.

The steps of a Machine Learning (ML) model can be an exception.

Each time we want to compute a model *(mathematical equation)* and make predictions with it, we would always make the following steps:

`model.fit()`

to**compute the numbers**of the mathematical equation..`model.predict()`

to**calculate predictions**through the mathematical equation.`model.score()`

to measure**how good the model's predictions are**.

And I am going to show you this with 3 different ML models.

`DecisionTreeClassifier()`

`RandomForestClassifier()`

`LogisticRegression()`

But first, let's load a dataset from CIS executing the lines of code below:

- The goal of this dataset is
- To predict
`internet_usage`

ofpeople(rows)- Based on their
socio-demographical characteristics(columns)

```
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/jsulopz/data/main/internet_usage_spain.csv')
df.head()
```

internet_usage | sex | age | education | |
---|---|---|---|---|

0 | 0 | Female | 66 | Elementary |

1 | 1 | Male | 72 | Elementary |

2 | 1 | Male | 48 | University |

3 | 0 | Male | 59 | PhD |

4 | 1 | Female | 44 | PhD |

We need to transform the categorical variables to **dummy variables** before computing the models:

```
df = pd.get_dummies(df, drop_first=True)
df.head()
```

Now we separate the variables on their respective role within the model:

```
target = df.internet_usage
explanatory = df.drop(columns='internet_usage')
```

```
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(X=explanatory, y=target)
pred_dt = model.predict(X=explanatory)
accuracy_dt = model.score(X=explanatory, y=target)
```

```
from sklearn.svm import SVC
model = SVC()
model.fit(X=explanatory, y=target)
pred_sv = model.predict(X=explanatory)
accuracy_sv = model.score(X=explanatory, y=target)
```

```
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
model.fit(X=explanatory, y=target)
pred_kn = model.predict(X=explanatory)
accuracy_kn = model.score(X=explanatory, y=target)
```

The only thing that changes are the results of the prediction. The models are different. But they all follow the **same steps** that we described at the beginning:

`model.fit()`

to compute the mathematical formula of the model`model.predict()`

to calculate predictions through the mathematical formula`model.score()`

to get the success ratio of the model

You may observe in the following table how the *different models make different predictions*, which often doesn't coincide with reality (misclassification).

For example, `model_svm`

doesn't correctly predict the row 214; as if this person *used internet* `pred_svm=1`

, but they didn't: `internet_usage`

for 214 in reality is 0.

```
df_pred = pd.DataFrame({'internet_usage': df.internet_usage,
'pred_dt': pred_dt,
'pred_svm': pred_sv,
'pred_lr': pred_kn})
df_pred.sample(10, random_state=7)
```

internet_usage | pred_dt | pred_svm | pred_lr | |
---|---|---|---|---|

214 | 0 | 0 | 1 | 0 |

2142 | 1 | 1 | 1 | 1 |

1680 | 1 | 0 | 0 | 0 |

1522 | 1 | 1 | 1 | 1 |

325 | 1 | 1 | 1 | 1 |

2283 | 1 | 1 | 1 | 1 |

1263 | 0 | 0 | 0 | 0 |

993 | 0 | 0 | 0 | 0 |

26 | 1 | 1 | 1 | 1 |

2190 | 0 | 0 | 0 | 0 |

Then, we could choose the model with a **higher number of successes** on predicting the reality.

```
df_accuracy = pd.DataFrame({'accuracy': [accuracy_dt, accuracy_sv, accuracy_kn]},
index = ['DecisionTreeClassifier()', 'SVC()', 'KNeighborsClassifier()'])
df_accuracy
```

accuracy | |
---|---|

DecisionTreeClassifier() | 0.859878 |

SVC() | 0.783707 |

KNeighborsClassifier() | 0.827291 |

Which is the best model here?

- Let me know in the comments below

A basic idea that we don't get at the beginning because we look for that perfect solution.

It doesnt exist.

It would be best to start thinking about choosing the "one" option, not "the" option.

Lets say that we are facing the following problem: visualise two variables with a scatterplot.

```
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv')
df.head()
```

In Python, youve got 3 libraries that can make a `scatterplot`

:

`matplotlib`

`seaborn`

`plotly`

Lets observe the differences:

```
import matplotlib.pyplot as plt
plt.scatter(x='total_bill', y='tip', data=df)
```

```
import seaborn as sns
sns.scatterplot(x='total_bill', y='tip', data=df)
```

```
import plotly.express as px
px.scatter(data_frame=df, x='total_bill', y='tip')
```

`matplotlib`

allows you to create custom plots, but you need to write more code.`seaborn`

automates the plot so that you dont need to write more lines. For example,`seaborn`

added the x & y axis labels by default.`matplotlib`

didnt.`plotly`

allows you to interact with the plot. Give it a try and hover the mouse over the points.

If you are to make a plot for an online post, you may like to use `plotly`

due to its interactivity. Nevertheless, you wouldnt use it if you were writing a paper article.

]]>I teach Python, R, Statistics & Data Science. I like to produce content that helps people to understand these topics better.

Feel free and welcomed to give me feedback as I would like to make my tutorials clearer and generate content that interests you 馃

You can see my Tutor Profile here if you need Private Tutoring lessons.

Don't make it yourself any harder!

Start with Data Visualization.

It's easier to understand programming with visual changes than abstract coding ("make a program that prints even numbers").

Get on Jupyter, a code editor. Here is the link to [download the program](https://www.anaconda.com/products/individual.

Your first lines of code should be as follows:

```
import seaborn as sns
df = sns.load_dataset('tips')
sns.scatterplot(x='total_bill', y='tip', data=df)
```

You would get a plot that should look like this one

To configure the behaviour of the function, you should configure the code as follows:

```
sns.scatterplot(x='total_bill', y='tip', data=df, color='red')
```

This simple change helps you to understand a couple of core concepts in programming:

**Functions**`sns.scatteplot()`

are used to make things in programming (a plot in this case).You use

**parameters**`color='red'`

to configure the function's behaviour.

Feel free and welcome to ask me anything in the comments below, it will be my pleasure to help you out 馃

]]>`shortcuts/atajos`

?Este es tu tutorial!

Aqu te dejo un par de tutoriales para que instales la herramienta que uso para trabajar con Python y la ms recomendada: Jupyter Lab.

Si clicas en el vdeo, te llevar a una playlist donde vers dos vdeos: uno para instalar en macOS (Apple) y otro para Windows.

`tab`

Muchas veces no sabemos exactamente las letras de las funciones.

Pueden variar las letras maysculas y minsculas, o nos comemos algunan `s`

porque son palabras inglesas.

Estas indecisiones se acabaron con el siguiente truco:

`shift + tab`

Muchas veces recurrimos a Google a buscar ayuda porque no sabemos lo que poner dentro de la funcin.

Pues bien, si usas las teclas `shift + tab`

con el cursor en alguna letra de la funcin, vers un panel de ayuda.

Este panel es un manual de instrucciones sobre cmo usar las funciones.

Curiosamente, ejecutamos acciones rutinarias de la misma forma que aprendimos la primera vez.

Nos va a costar ms desacer las malas costumbres que adquirir las buenas.

Los dos consejos que te expongo harn que la mquina trabaje para ti porque sabrs lo que s se puede y lo que no se puede hacer.

Evitars perderte cuando uses funciones, o quieras importar un objeto o una librera.

As que te merece la pena usarlo cada vez que tengas la oportunidad para adoptarlo como hbito.

**Recuerda**:

`tab`

para las sugerencias.`shift + tab`

para la ayuda.

salos la prxima vez que tengas la oportunidad.

Para una explicacin ms detallada y dinmica, te recomiendo que veas este vdeo

]]>