This morning, I read the Economist Espresso on India's pollution season, and I thought it was a good idea to start the series of challenges with this topic.

After navigating many websites, such as India's Central Pollution Control Board and WHO, I found this website about Air Quality Data where we can download the data from many places worldwide.

I chose Delhi to be the city we will analyze in this challenge.

Executing the following lines of code will produce the DataFrame we'll work with:

`import pandas as pddf = pd.read_csv('anand-vihar, delhi-air-quality.csv', parse_dates=['date'], index_col=0)df`

I needed to process the data to deliver a workable dataset in the following way:

`#remove whitespaces in columnsdf.columns = df.columns.str.strip()#get the rows with the numbers (some of them where whitespaces)series = df['pm25'].str.extract('(\w+)')[0]#rolling average to armonize the data monthlyseries_monthly = series.rolling(30).mean()#remove missing datesseries_monthly = series_monthly.dropna()#fill missing dates by linear interpolationseries_monthly = series_monthly.interpolate(method='linear')#sorting the index to later make a reasonable plotseries_monthly = series_monthly.sort_index()#aggregate the information by monthseries_monthly = series_monthly.to_period('M').groupby(level='date').mean()#process a timestamp to avoid errors with statsmodels' functionsseries_monthly = series_monthly.to_timestamp()#setting freq to avoid errors with statsmodels' functionsseries_monthly = series_monthly.asfreq("MS").interpolate()#change the name of the pandas.Seriesseries_monthly.name = 'air pollution pm25'`

As we don't know the coding skills of this Study Circle member, we'll start with simple ARIMA models. From this point, we will iterate the procedure and improve the dynamic.

To take on the challenge and maybe, receive some feedback, you should fork this repository to your GitHub account. Otherwise, you can download this script.

The end goal is to develop an ARIMA model and plot the predictions against the actual data. Resulting in a plot like the this.

Nevertheless, you can develop this challenge in any way you find attractive. The essential point of this Study Circle is the interactivity between the members to generate value and knowledge.

From your feedback, we could later work on different use cases. For example, we could later create a geospatial map in Python with the predictions.

So, let's get on and good luck!

You start with the following object:

Check out the following materials to learn how you could develop the challenge:

- Video Tutorial: How to develop ARIMA models to predict Stock Price

`series_monthly`

`date2014-01-01 286.0234572014-02-01 281.428205 ... 2022-08-01 115.4870972022-09-01 143.713333Freq: MS, Name: air pollution pm25, Length: 105, dtype: float64`

It's not the same to observe the data in numbers than in a chart:

`series_monthly.plot();`

We aim to compute a mathematical equation that we will later use to calculate predictions, as we can see in the following chart:

There are many types of mathematical equations, the one we will use is `ARIMA`

. Don't worry about the maths, we need a Python function to make it all for us.

`from statsmodels.tsa.arima.model import ARIMA`

The parameters of this Class ask for two objects:

`endog`

: the data`order`

: (p,d,q)`p`

: the first significant lag in the Autocorrelation Plot`d`

: the diff needed to make our data stationary`q`

: the first significant lag in the Partial Autocorrelation Plot

`d`

| Diff to get data stationarityThe first thing we need to check about our data is stationarity. We use the Augmented Dickey-Fuller test intending to reject the null hypothesis in which we state that the data is non-stationary. If that's the case, we need to differentiate the time series and adjust the number `d:1`

in the parameter `order=(p, d:1, q)`

.

`from statsmodels.tsa.stattools import adfullerresult = adfuller(series_monthly)`

The p-value is given by the second element the function `adfuller`

returns:

`result[1]`

`-> 0.4244071993737921`

The p-value is greater than 0.05. Therefore, we can't reject the null hypothesis.

Are we done here?

- No, we can differentiate the Time Series by one lag and test again:

`series_monthly_diff_1 = series_monthly.diff().dropna()result = adfuller(series_monthly_diff_1)result[1]`

`-> 2.4066471086483724e-24`

We can reject the null hypothesis and say that our data is stationary with a lag of 1. Therefore, we need to set `d:1`

in the `order`

parameter of the `ARIMA()`

class.

`q`

| Autocorrelation PlotNow we need to determine `q`

based on the first significant lag of the autocorrelation plot:

`from statsmodels.graphics.tsaplots import plot_acfplot_acf(series_monthly_diff_1, lags=50)plt.xlabel('Lag');`

The first significant lag is the 2, which means that our differentiated data (monthly) is correlated every two months. We set `q=2`

.

`p`

| Partial Autocorrelation PlotWe follow the same procedure to choose a number for `p`

. But this time, we use another type of plot: Partial Autocorrelation.

`from statsmodels.graphics.tsaplots import plot_pacfplot_pacf(series_monthly_diff_1, lags=50, method='ywm')plt.xlabel('Lag');`

We see the first significant lag at 2. Therefore, we set `p=2`

.

We already know which numbers we set on the `order`

parameter: `order=(p:2, d:1, q:2)`

. So, let's fit the mathematical equation of the model.

`model = ARIMA(series_monthly, order=(2,1,2))result = model.fit()result.summary()`

And calculate the predictions:

`import matplotlib.pyplot as pltplt.figure(figsize=(6,4))series_monthly.plot(label='Actual Data')result.predict().plot(label='Predicted Data')plt.legend()plt.xticks(rotation=45);`

]]>`for`

loop if you want to summarise your daily Time Series by years. Instead, use the function `resample()`

from pandas.

Let me explain it with an example.

We start by loading a DataFrame from a CSV file that contains information on the TSLA stock from 2017-2022.

`import pandas as pdurl = 'https://raw.githubusercontent.com/jsulopzs/data/main/tsla_stock.csv'df_tsla = pd.read_csv(filepath_or_buffer=url)df_tsla`

cc: @elonmusk

You're welcome for the promotion 馃槈

You must ensure that column `Date's dtype`

is DateTime.

It must not be an object as in the picture (often interpreted as a string).

`df_tsla.dtypes.to_frame(name='dtype')`

We need to convert the Date column into a `datetime`

dtype. To do so, we can use the function `pd.to_datetime()`

:

`df_tsla.Date = pd.to_datetime(df_tsla.Date)df_tsla.dtypes.to_frame(name='dtype')`

Before getting into the resample() function, we need to **set the column Date as the index** of the DataFrame:

`df_tsla.set_index('Date', inplace=True)df_tsla`

Now let the magic happen; we'll get the maximum value of each column by each year with this simple line of code:

`df_tsla.resample(rule='Y').max()`

We can do many other things:

- Summarise by Quarter.
- Calculate the average and the standard deviation (volatility).

`df_tsla.resample(rule='Q').agg(['mean', 'std'])`

To finish it, I always like to add a `background_gradient()`

to the DataFrame:

`df_tsla.resample(rule='Y').max().style.background_gradient('Greens')`

If you enjoyed this, I'd appreciate it if you could support my work by spreading the word 馃槉

]]>Sometimes, we want to select specific parts of the DataFrame to highlight some data points.

In this case, we refer to the topic as locating & filtering.

For example, let's load the dataset of cars:

`import seaborn as snsdf_mpg = sns.load_dataset('mpg', index_col='name').drop(columns=['cylinders', 'model_year', 'origin'])df_mpg`

To filter the best cars in each statistics/column.

First, we calculate the maximum values in each column:

`df_mpg.max()`

mpg 46.6 displacement 455.0 horsepower 230.0 weight 5140.0 acceleration 24.8 dtype: float64

Then, we create a mask (array with True/False) to capture the rows where we have the cars with maximum values:

`mask_max = (df_mpg == df_mpg.max()).sum(axis=1) > 0mask_max`

name chevrolet chevelle malibu False buick skylark 320 False ...

ford ranger False chevy s-10 False Length: 398, dtype: bool

Select the rows where the mask is True:

`df_mpg_max = df_mpg[mask_max].copy()df_mpg_max`

And add some styling:

`df_mpg_max.style.format('{:.0f}').background_gradient()`

To understand the reasoning behind the previous example, read the rest of the article, where we explain the logic from the most basic example to locating data based on the index.

By now, we should know the difference between the brackets `[]`

and the parenthesis `()`

.

We use brackets to select parts of an object. For example, let's create a list of days:

`list_days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']`

And select the second element:

`list_days[1]`

'Tuesday'

Or the last element:

`list_days[-1]`

'Sunday'

Until the third element (included):

`list_days[:3]`

['Monday', 'Tuesday', 'Wednesday']

Nevertheless, the `list`

is a simple element of Python. To get more functionalities, we use the `Series`

object from `pandas`

library.

Let's create a `Series`

to store the **Apple Stock Return on Investment (ROI)** by quarters:

`import pandas as pdsr_apple = pd.Series( data=[59.02, 63.57, 66.93, 69.05], index=['1Q', '2Q', '3Q', '4Q'])sr_apple`

1Q 59.02 2Q 63.57 3Q 66.93 4Q 69.05 dtype: float64

`iloc`

(integer-location) propertyWe use `.iloc[]`

to select parts of the object based on the integer position of the element.

For example, let's select the first quarter ROI:

`sr_apple.iloc[0]`

59.02

Now, let's select the first and third quarters:

To select more than one object, we need to use double brackets `[[]]`

:

`sr_apple.iloc[[0,2,3]]`

1Q 59.02 3Q 66.93 4Q 69.05 dtype: float64

Could we have accessed with the name `1Q`

?

`sr_apple.iloc['Q1']`

---------------------------------------------------------------------------

TypeError Traceback (most recent call last)

Input In [99], in <cell line: 1>() ----> 1 sr_apple.iloc['Q1']

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexing.py:967, in _LocationIndexer.**getitem**(self, key) 964 axis = self.axis or 0 966 maybe_callable = com.apply_if_callable(key, self.obj) --> 967 return self._getitem_axis(maybe_callable, axis=axis)

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexing.py:1517, in _iLocIndexer._getitem_axis(self, key, axis) 1515 key = item_from_zerodim(key) 1516 if not is_integer(key): -> 1517 raise TypeError("Cannot index by location index with a non-integer key") 1519 # validate the location 1520 self._validate_integer(key, axis)

TypeError: Cannot index by location index with a non-integer key

The `iloc`

property only works in `integers`

(the position of the subelements we want).

To select the elements by their **label/name**, we need to use the `loc`

property:

`loc`

(location) propertyWe select parts of an object with the `.loc[]`

instance based on the **label/name** of the `index`

:

`sr_apple.loc['1Q']`

59.02

`sr_apple.loc[['1Q', '3Q', '4Q']]`

1Q 59.02 3Q 66.93 4Q 69.05 dtype: float64

If we would like to access by the position, we'd get an error:

`sr_apple.loc[0]`

---------------------------------------------------------------------------

KeyError Traceback (most recent call last)

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3621, in Index.get_loc(self, key, method, tolerance) 3620 try: -> 3621 return self._engine.get_loc(casted_key) 3622 except KeyError as err:

File ~/miniforge3/lib/python3.9/site-packages/pandas/_libs/index.pyx:136, in pandas._libs.index.IndexEngine.get_loc()

File ~/miniforge3/lib/python3.9/site-packages/pandas/_libs/index.pyx:163, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5198, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5206, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

The above exception was the direct cause of the following exception:

KeyError Traceback (most recent call last)

Input In [102], in <cell line: 1>() ----> 1 sr_apple.loc[0]

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexing.py:967, in _LocationIndexer.**getitem**(self, key) 964 axis = self.axis or 0 966 maybe_callable = com.apply_if_callable(key, self.obj) --> 967 return self._getitem_axis(maybe_callable, axis=axis)

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexing.py:1202, in _LocIndexer._getitem_axis(self, key, axis) 1200 # fall thru to straight lookup 1201 self._validate_key(key, axis) -> 1202 return self._get_label(key, axis=axis)

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexing.py:1153, in _LocIndexer._get_label(self, label, axis) 1151 def _get_label(self, label, axis: int): 1152 # GH#5667 this will fail if the label is not present in the axis. -> 1153 return self.obj.xs(label, axis=axis)

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/generic.py:3864, in NDFrame.xs(self, key, axis, level, drop_level) 3862 new_index = index[loc] 3863 else: -> 3864 loc = index.get_loc(key) 3866 if isinstance(loc, np.ndarray): 3867 if loc.dtype == np.bool_:

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3623, in Index.get_loc(self, key, method, tolerance) 3621 return self._engine.get_loc(casted_key) 3622 except KeyError as err: -> 3623 raise KeyError(key) from err 3624 except TypeError: 3625 # If we have a listlike key, _check_indexing_error will raise 3626 # InvalidIndexError. Otherwise we fall through and re-raise 3627 # the TypeError. 3628 self._check_indexing_error(key)

KeyError: 0

It results in `KeyError`

because we don't have any `Key`

in the `index`

to be `0`

:

`sr_apple`

1Q 59.02 2Q 63.57 3Q 66.93 4Q 69.05 dtype: float64

We have:

`sr_apple.keys()`

Index(['1Q', '2Q', '3Q', '4Q'], dtype='object')

The `loc`

property only works **with the labels, not the position**.

Now we'd like to select parts based on a condition. For example, let's show the quarters we had a Return on Investment (ROI) above 60.

First, we create a boolean object based on the stated condition:

`sr_apple`

1Q 59.02 2Q 63.57 3Q 66.93 4Q 69.05 dtype: float64

`sr_apple > 60`

1Q False 2Q True 3Q True 4Q True dtype: bool

`mask_60 = sr_apple > 60`

Now we pass the previous object to the `.loc`

property:

`sr_apple.loc[mask_60]`

2Q 63.57 3Q 66.93 4Q 69.05 dtype: float64

And here, we have the data for which the ROI is higher than 60.

`[]`

`sr_apple`

1Q 59.02 2Q 63.57 3Q 66.93 4Q 69.05 dtype: float64

We could also access the data by only using the brackets, without the ~`.iloc`

~ property:

`sr_apple['1Q']`

59.02

And also, the position:

`sr_apple[0]`

59.02

And the mask:

`sr_apple[mask_60]`

2Q 63.57 3Q 66.93 4Q 69.05 dtype: float64

So far, we have played with **1-Dimensional** objects. Now it's time to level up and play with **2-Dimensional** objects, like the `DataFrame`

.

Let's play with a dataset of cars:

`import seaborn as snsdf_mpg = sns.load_dataset(name='mpg', index_col='name')df_mpg`

`iloc`

(integer-location) propertyWe can select the second row:

`df_mpg.iloc[2]`

mpg 18.0 cylinders 8 displacement 318.0 horsepower 150.0 weight 3436 acceleration 11.0 model_year 70 origin usa Name: plymouth satellite, dtype: object

And keep the `DataFrame`

style if we use double brackets `[[]]`

:

`df_mpg.iloc[[2]]`

We can also **slice** (a term used for filtering as well) consecutive elements of the DataFrame with the **colon** `:`

.

For example, let's select the first 4 rows:

`df_mpg.iloc[:4]`

Instead of:

`df_mpg.iloc[[0,1,2,3]]`

We can also select the columns we want.

For example, let's select the first 3 columns:

`df_mpg.iloc[:4, :3]`

Learn how to become an independent Data Analyist programmer who knows how to extract meaningful insights from Data Visualizations.

Or the rest of the columns from the 3rd position (not included):

`df_mpg.iloc[:4, 3:]`

Or the last 3 columns by using the `-`

:

`df_mpg.iloc[:4, -3:]`

`loc`

(location) propertyWe can also select parts of the DataFrame based on the **index and column labels** (2-Dimensions):

`df_mpg.loc[['ford torino', 'fiat 124 sport coupe'], ['origin', 'model_year', 'cylinders']]`

`df_mpg.loc[:'fiat 124 sport coupe', :'cylinders']`

Out of all the cars:

`df_mpg.index`

Index(['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth satellite', 'amc rebel sst', 'ford torino', 'ford galaxie 500', 'chevrolet impala', 'plymouth fury iii', 'pontiac catalina', 'amc ambassador dpl', ... 'chrysler lebaron medallion', 'ford granada l', 'toyota celica gt', 'dodge charger 2.2', 'chevrolet camaro', 'ford mustang gl', 'vw pickup', 'dodge rampage', 'ford ranger', 'chevy s-10'], dtype='object', name='name', length=398)

We could select all the **fiat** cars if we had a boolean array based on this condition:

`mask_fiat = df_mpg.index.str.contains('fiat')mask_fiat`

array([False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, True, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False])

We can observe a few `True`

s where we find some **Fiats**.

Let's filter them and show all the columns with the `:`

:

`df_mpg.loc[mask_fiat, :]`

Although we could have omitted the `:`

:

`df_mpg.loc[mask_fiat]`

`&`

Just the fiats whose horsepower is above 80:

`mask_hp = df_mpg.horsepower > 80mask_hp`

name chevrolet chevelle malibu True buick skylark 320 True ...

ford ranger False chevy s-10 True Name: horsepower, Length: 398, dtype: bool

`df_mpg.loc[mask_hp & mask_fiat, :]`

`|`

We could also select all fiats **OR** cars whose horsepower is above 80:

`df_mpg.loc[mask_hp | mask_fiat, :]`

`[]`

We can select the columns by their labels:

`df_mpg['acceleration']`

name chevrolet chevelle malibu 12.0 buick skylark 320 11.5 ... ford ranger 18.6 chevy s-10 19.4 Name: acceleration, Length: 398, dtype: float64

`df_mpg[['acceleration', 'origin', 'model_year']]`

But we can't select the rows by the index labels:

`df_mpg['amc rebel sst']`

---------------------------------------------------------------------------

KeyError Traceback (most recent call last)

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3621, in Index.get_loc(self, key, method, tolerance) 3620 try: -> 3621 return self._engine.get_loc(casted_key) 3622 except KeyError as err:

...

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3623, in Index.get_loc(self, key, method, tolerance) 3621 return self._engine.get_loc(casted_key) 3622 except KeyError as err: -> 3623 raise KeyError(key) from err 3624 except TypeError: 3625 # If we have a listlike key, _check_indexing_error will raise 3626 # InvalidIndexError. Otherwise we fall through and re-raise 3627 # the TypeError. 3628 self._check_indexing_error(key)

KeyError: 'amc rebel sst'

Unless we use the colon `:`

:

`df_mpg[:'amc rebel sst']`

`df_mpg['buick skylark 320':'amc rebel sst']`

We can also select the rows by position:

`df_mpg[:4]`

But we can't select both rows and columns (2-Dimensions):

`df_mpg[:4,:3]`

---------------------------------------------------------------------------

TypeError Traceback (most recent call last)

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3621, in Index.get_loc(self, key, method, tolerance) 3620 try: -> 3621 return self._engine.get_loc(casted_key) 3622 except KeyError as err:

...

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexes/base.py:5637, in Index._check_indexing_error(self, key) 5633 def _check_indexing_error(self, key): 5634 if not is_scalar(key): 5635 # if key is not a scalar, directly raise an error (the code below 5636 # would convert to numpy arrays and raise later any way) - GH29926 -> 5637 raise InvalidIndexError(key)

InvalidIndexError: (slice(None, 4, None), slice(None, 3, None))

Unless we specify the columns we want in extra brackets:

`df_mpg[:4]['acceleration']`

name chevrolet chevelle malibu 12.0 buick skylark 320 11.5 plymouth satellite 11.0 amc rebel sst 12.0 Name: acceleration, dtype: float64

`df_mpg[:4][['acceleration']]`

`df_mpg[:4][['acceleration', 'origin']]`

We can also select the rows given *boolean-arrays* (a.k.a. **masks**):

`df_mpg[mask_fiat]`

`df_mpg[mask_fiat | mask_hp]`

`df_mpg[mask_fiat & mask_hp]`

It doesn't mean that I cannot later select the columns that we want (programming is the art of everything, we just need to find a way):

`df_mpg[mask_fiat & mask_hp]['mpg']`

name fiat 124 sport coupe 26.0 fiat 131 28.0 Name: mpg, dtype: float64

`df_mpg[mask_fiat & mask_hp][['mpg', 'origin', 'model_year']]`

Everything may be a bit confusing, but we hope you get the main idea behind `locating`

and `masking`

:

Select the parts of an object with brackets

`[]`

We can access it through

The label/name

`loc`

The integer position

`iloc`

Masks:

*boolean arrays*based on conditionsJust the brackets

`[]`

*

If the object has:

1-Dimension

`object[:]`

2-Dimension

`object[:,:]`

*Carefully because it has many variations of use case, as we observed above

Let's load a dataset with various categorical columns since we summarise data based on categories, not numbers.

`df_tips = sns.load_dataset(name='tips')df_tips`

Let's make a pivot table to summarise the information to obtain a Hierarchical* DataFrame.

`dfres = df_tips.pivot_table(index=['smoker', 'time'], columns='sex', aggfunc='size')dfres`

*A Hierarchical DataFrame (MultiIndex) contains two "columns" as an index. As we may observe below:

`dfres.index`

MultiIndex([('Yes', 'Lunch'), ('Yes', 'Dinner'), ( 'No', 'Lunch'), ( 'No', 'Dinner')], names=['smoker', 'time'])

Let's locate some parts of the Hierarchical DataFrame:

`dfres`

By using the `.loc`

property:

`dfres.loc['Yes', :]`

`dfres.loc['No', :]`

As we have multiple indexes `[index1, index2, columns]`

, we can select a part of the second index:

`dfres.loc[:, 'Lunch', :]`

`dfres.loc[:, 'Dinner', :]`

Let's now play with a DataFrame that is both `MultiIndex`

and `MultiColumns`

:

`dfres = df_tips.pivot_table(index=['smoker', 'time'], columns=['sex', 'day'], aggfunc='size')dfres`

We may observe two levels in the columns above.

`loc`

(location) propertyWe apply the same reasoning we used in the previous sections, `[index1, index2, column1, column2]`

.

`dfres.loc['No', :, :, :]`

Although, we can make it shorter.

`dfres.loc['No', :]`

The same applies to the second index:

`dfres.loc[:,'Dinner', :, :]`

`dfres.loc[:,'Dinner', :]`

Let's try to get Dinners on Sundays:

`dfres.loc[:, 'Dinner', :, 'Sun']`

---------------------------------------------------------------------------

IndexError Traceback (most recent call last)

Input In [158], in <cell line: 1>() ----> 1 dfres.loc[:, 'Dinner', :, 'Sun']

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexing.py:961, in _LocationIndexer.**getitem**(self, key) 959 if self._is_scalar_access(key): 960 return self.obj._get_value(*key, takeable=self._takeable) --> 961 return self._getitem_tuple(key) 962 else: 963 # we by definition only have the 0th axis 964 axis = self.axis or 0

...

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexes/frozen.py:70, in FrozenList.**getitem**(self, n) 68 if isinstance(n, slice): 69 return type(self)(super().**getitem**(n)) ---> 70 return super().**getitem**(n)

IndexError: list index out of range

To make it work, this time we need to create an intermediate object to separate rows and columns:

`idx = pd.IndexSlicedfres.loc[idx[:, 'Dinner'], idx[:, 'Sun']]`

`dfres.loc[idx[:, 'Dinner'], idx['Male', :]]`

We can also use the `slice()`

property:

`dfres.loc[('Yes', slice(None)), (slice(None), 'Sun')]`

`dfres.loc['Yes', ('Female', slice(None))]`

`dfres.loc[(slice(None), 'Lunch'), 'Female']`

`dfres.loc[(slice(None), 'Lunch'), ('Female', slice(None))]`

`dfres.loc[idx[:, 'Dinner'], idx['Female', :]]`

`iloc`

(integer-location) property`dfres`

As always, we can select by the position of the values with the `iloc`

property:

`dfres.iloc[:2, :2]`

`dfres.iloc[:2, 2:]`

Now, we will use a DataFrame that has a `DateTimeIndex`

:

`df_tsla = pd.read_excel('tsla_stock.xlsx', index_col=0)df_tsla`

`loc`

(location) propertyWe can select parts of the DataFrame based on just one part of the `DateTimeIndex`

. For example, we can select everything from the year 2020 and move forward:

`df_tsla.loc['2020':]`

Until the last day of 2020:

`df_tsla.loc[:'2020']`

Between two years:

`df_tsla.loc['2019':'2020']`

One complete year:

`df_tsla.loc['2019']`

We can even select a specific `year-month`

:

`df_tsla.loc['2019-06']`

`iloc`

(integer-location) propertyOf course, we can also select parts of the DataFrame based on the position of the values with `iloc`

:

`df_tsla.iloc[:4, :3]`

`df_tsla.iloc[-4:, :3]`

]]>Take a look at this article to understand why Python is the programming language of the present and the future.

Okay, you've already got a reason to learn Python: you will have more chances because Python-related job offers will grow more and more over the coming years.

As the previous article states, you may end up in a job earning 65k a year; you may not worry so much about money when making decisions as you move forward in life!

IT Jobs Watch, a website that specialises in collating salary data across the IT industry, states that the median annual salary in the UK for a role requiring Python skills is 65,000.

Before getting there, keep your feet on Earth because you need to master Python.

Don't think you need a Computer Science degree or a Data Science master's to master Python before getting the job.

The best motivation to learn anything is that they pay you for it. Therefore, it would help to prioritise getting a job where you increase your Python skills.

Companies care about getting shit done. Therefore, you must show them what you know about programming and how you program.

Complete online courses to show what you know with certificates

Solve their assignments

Get your data and experiment with the learnt concepts

Showcase your knowledge on GitHub to show how you program (you may look at Edo's profile to see his portfolio)

He followed our advice and got a job in two months. He applied to around a hundred job offers on LinkedIn where recruiters could see his certifications.

Make it easy at the beginning with easy-to-understand Python code.

Some people use scripts to code. It'd help if you turned to the notebook format because you could see the output of every line right away. Follow this tutorial to install Jupyter Lab, the best program to work with notebooks and write your Python's first lines of code.

I have found Data Visualization to be the best starting topic because you immediately see how the output changes as you change the code. It gives you a massive dope of energy.

Follow this Data Visualization tutorial to get a complete overview of Data Visualization development in Python. Then, play around with the lines of code: add more data points to the plots or change the colour of the figures.

Once you are motivated and comfortable using Python, it is time to follow a proper learning path.

You can follow any roadmaps, but please make sure you don't overestimate your skills and start developing Neural Networks if you don't know how to create a simple Linear Regression.

You can follow Edo's roadmap by looking at his certifications:

You can also read the following thread, where I placed links to practical exercises you can use in your portfolio.

]]>The time has come to add another layer to the hierarchy of Machine Learning models.

Do we have the variable we want to predict in the dataset?

YES: **Supervised Learning**

Predicting a Numerical Variable Regression

Predicting a Categorical Variable Classification

NO: **Unsupervised Learning**

- Group Data Points based on Explanatory Variables Cluster Analysis

We may have, for example, all football players, and we want to group them based on their performance. But we don't know the groups beforehand. So what do we do then?

We apply Unsupervised Machine Learning models to group the players based on their position in the space (determined by the explanatory variables): the closer the players are to the space, the more likely they'll be drawn to the same group.

Another typical example comes from e-commerces that don't know if their customers like clothing or tech. But they know how they interact on the website. Therefore, they group the customers to send promotional emails that align with their likes.

In short, we close the circle with the different types of Machine Learning models by adding this new type.

Let's now develop the Python code.

Imagine for a second you are the President of the United States of America, and you are considering creating campaigns to reduce **car accidents due to alcohol** consumption controlling the impact of **insurance companies' losses** (columns).

You won't create 51 TV campaigns for each of the **States of the USA** (rows). Instead, you will see which States behave similarly to cluster them into three groups.

`import seaborn as sns #!import pandas as pddf_crashes = sns.load_dataset(name='car_crashes', index_col='abbrev')[['alcohol', 'ins_losses']]df_crashes`

We don't have any missing data in any of the columns:

`df_crashes.isna().sum()`

alcohol 0 ins_losses 0 dtype: int64

Neither we need to convert categorical columns to *dummy variables* because the two we are considering are numerical.

`df_crashes`

We should know from previous chapters that we need a function accessible from a Class in the library `sklearn`

.

`from sklearn.cluster import KMeans`

Create a copy of the original code blueprint to not "modify" the source code.

`model_km = KMeans()`

The theoretical action we'd like to perform is the one we executed in previous chapters. Therefore, the function to compute the Machine Learning model should be called the same way:

`model_km.fit()`

---------------------------------------------------------------------------

TypeError Traceback (most recent call last)

Input In [6], in <cell line: 1>() ----> 1 model_km.fit()

TypeError: fit() missing 1 required positional argument: 'X'

The previous types of models asked for two parameters:

`y`

: target ~ independent ~ label ~ class variable`X`

: explanatory ~ dependent ~ feature variables

Why is it asking for just one parameter now, `X`

?

As we said before, this model (unsupervised learning) doesn't know how the groups are calculated beforehand; they know after we compute the Machine Learning model. Therefore, they don't need to see the target variable `y`

.

We don't need to separate the variables because we have explanatory ones.

`model_km.fit(X=df_crashes)`

KMeans()

We have a fitted `KMeans`

. Therefore, we should be able to apply the mathematical equation to the original data to get the predictions:

`model_km.predict(X=df_crashes)`

array([7, 3, 0, 7, 6, 7, 6, 1, 3, 7, 7, 5, 4, 7, 0, 5, 3, 3, 2, 4, 2, 3, 1, 3, 1, 7, 4, 5, 7, 5, 1, 5, 1, 3, 0, 3, 6, 0, 1, 1, 5, 4, 1, 1, 0, 0, 1, 0, 1, 0, 5], dtype=int32)

We wanted to calculate three groups, but Python is calculating eight groups. Let's modify this hyperparameter of the `KMeans`

model:

`model_km = KMeans(n_clusters=3)model_km.fit(X=df_crashes)model_km.predict(X=df_crashes)`

array([0, 0, 1, 0, 2, 0, 2, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 2, 1, 2, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 2, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1], dtype=int32)

Let's create a new `DataFrame`

to keep the original dataset untouched:

`df_pred = df_crashes.copy()`

And add the predictions:

`df_pred['pred_km'] = model_km.predict(X=df_crashes)df_pred`

How can we see the groups in the plot?

Can you observe that the k-Means only considers the variable `ins_losses`

to determine the group the point belongs to? Why?

`sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_km', palette='Set1', data=df_pred);`

The model measures the distance between the points. They seem to be spread around the plot but aren't; the plot doesn't place the points in perspective (it's lying to us).

Learn how to become an independent Machine Learning programmer who knows when to apply any ML algorithm to any dataset.

Take a look at the following video to understand how the `KMeans`

algorithm computes the Mathematical Equation by **calculating distances**:

The model understands the data as follows:

`import matplotlib.pyplot as pltsns.scatterplot(x='alcohol', y='ins_losses', hue='pred_km', palette='Set1', data=df_pred)plt.xlim(0, 200)plt.ylim(0, 200);`

Now it's evident why the model only took into account `ins_losses`

: it barely sees significant distances within `alcohol`

compared to `ins_losses`

.

In other words, with a metaphor: it is not the same to increase one kg of weight than one meter of height.

Then, how can we create a `KMeans`

model that compares the two variables equally?

- We need to scale the data (i.e., transforming the values into the same range: from 0 to 1) with the
`MinMaxScaler`

.

`MinMaxScaler()`

the dataAs with any other algorithm within the `sklearn`

library, we need to:

Import the

`Class`

Create the

`instance`

`fit()`

the numbers of the mathematical equation`predict/transform`

the data with the mathematical equation

`from sklearn.preprocessing import MinMaxScalerscaler = MinMaxScaler()scaler.fit(df_crashes)data = scaler.transform(df_crashes)data[:5]`

array([[0.47921847, 0.55636883], [0.34718769, 0.45684192], [0.42806394, 0.24636258], [0.50100651, 0.5323574 ], [0.20923623, 0.73980184]])

To better understand the information, let's convert the `array`

into a `DataFrame`

:

`df_scaled = pd.DataFrame(data, columns=df_crashes.columns, index=df_crashes.index)df_scaled`

`model_km.fit(X=df_scaled)`

KMeans(n_clusters=3)

We have a fitted `KMeans`

. Therefore, we should be able to apply the mathematical equation to the original data to get the predictions:

`model_km.predict(X=df_scaled)`

array([1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 2, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 2, 0, 1, 0, 1, 0, 1, 0, 2, 0, 1, 0, 1, 1, 2, 2, 1, 1, 0, 0, 1, 0, 1, 0, 0], dtype=int32)

`df_pred['pred_km_scaled'] = model_km.predict(X=df_scaled)df_pred`

We can observe now that both `alcohol`

and `ins_losses`

are taken into account by the model to calculate the cluster a point belongs to.

`sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_km_scaled', palette='Set1', data=df_pred);`

From now on, we should understand that every time a model calculates distances between variables of different numerical ranges, we need to scale the data to compare them properly.

The following figure gives an overview of everything that has happened so far:

`fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(14, 7))sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_km', data=df_pred, palette='Set1', ax=ax1);sns.scatterplot(x='alcohol', y='ins_losses', hue=df_pred.pred_km_scaled, data=df_scaled, palette='Set1', ax=ax2);sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_km', data=df_pred, palette='Set1', ax=ax3);sns.scatterplot(x='alcohol', y='ins_losses', hue=df_pred.pred_km_scaled, data=df_scaled, palette='Set1', ax=ax4);ax3.set_xlim(0, 200)ax3.set_ylim(0, 200)ax4.set_xlim(0, 200)ax4.set_ylim(0, 200)ax1.set_title('KMeans w/ Original Data & Liar Plot')ax2.set_title('KMeans w/ Scaled Data & Perspective Plot')ax3.set_title('KMeans w/ Original Data & Perspective Plot')ax4.set_title('KMeans w/ Original Data & Perspective Plot')plt.tight_layout()`

`Clustering`

Models in PythonVisit the `sklearn`

website to see how many different clustering methods are and how they differ from each other.

Let's **pick two new models** and compute them:

We follow the same procedure as for any Machine Learning model from the Scikit-Learn library:

`from sklearn.cluster import AgglomerativeClusteringmodel_ac = AgglomerativeClustering(n_clusters=3)model_ac.fit(df_scaled)`

AgglomerativeClustering(n_clusters=3)

`model_ac.fit_predict(X=df_scaled)`

array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 1, 0, 1, 0, 1, 0, 1, 2, 0, 0, 1, 0, 0, 2, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1])

`df_pred['pred_ac'] = model_ac.fit_predict(X=df_scaled)df_pred`

We can observe how the second group takes three points with the Agglomerative Clustering while the KMeans gather five points in the second group.

As they are different algorithms, they are expected to produce different results. If you'd like to understand which models you should use, you may know how the algorithm works. We don't explain it in this series because we want to make it simple.

`fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_km_scaled', data=df_pred, palette='Set1', ax=ax1);sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_ac', data=df_pred, palette='Set1', ax=ax2)ax1.set_title('KMeans')ax2.set_title('Agglomerative Clustering');`

We follow the same procedure as for any Machine Learning model from the Scikit-Learn library:

`from sklearn.cluster import SpectralClusteringmodel_sc = SpectralClustering(n_clusters=3)model_sc.fit(df_scaled)`

SpectralClustering(n_clusters=3)

`model_sc.fit_predict(X=df_scaled)`

array([0, 2, 2, 0, 0, 2, 0, 0, 2, 0, 0, 1, 2, 0, 2, 2, 2, 0, 0, 2, 0, 2, 0, 2, 0, 0, 1, 2, 0, 2, 0, 2, 0, 2, 1, 2, 0, 2, 0, 0, 1, 1, 0, 0, 2, 2, 0, 2, 0, 2, 2], dtype=int32)

`df_pred['pred_sc'] = model_sc.fit_predict(X=df_scaled)df_pred`

Let's visualize all models together and appreciate the minor differences because they cluster the groups differently.

`fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 5))ax1.set_title('KMeans')sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_km_scaled', data=df_pred, palette='Set1', ax=ax1);ax2.set_title('Agglomerative Clustering')sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_ac', data=df_pred, palette='Set1', ax=ax2);ax3.set_title('Spectral Clustering')sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_sc', data=df_pred, palette='Set1', ax=ax3);`

Once again, you don't need to know the maths behind every Machine Learning model to build them. However, I hope you are getting a sense of the patterns behind the Scikit-Learn library with this series of tutorials.

Let's arbitrarily choose the Agglomerative Clustering as our model and get back to you being the President of the USA. How would you describe the groups?

Higher

`ins_losses`

and lower`alcohol`

Lower

`ins_losses`

and lower`alcohol`

Lower

`ins_losses`

and higher`alcohol`

`sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_ac', data=df_pred, palette='Set1');`

You would create different messages on the TV campaigns for the three groups separately and avoid deploying many more resources to develop fifty-one various TV campaigns (one for each State), which doesn't make sense because many of them are similar.

]]>Ask him any doubt on **Twitter** or **LinkedIn**

Look at the following example as an aspiration you can achieve if you fully understand and replicate this whole tutorial with your data.

Let's load a dataset containing information on the Tesla Stock daily (rows) transactions (columns) in the Stock Market.

`import pandas as pdurl = 'https://raw.githubusercontent.com/jsulopzs/data/main/tsla_stock.csv'df_tesla = pd.read_csv(url, index_col=0, parse_dates=['Date'])df_tesla`

You may calculate the `.mean()`

of each column by the last Business day of each Month (BM):

`df_tesla.resample('BM').mean()`

Or the Weekly Average:

`df_tesla.resample('W-FRI').mean()`

And many more; see the full list here.

Pretty straightforward compared to other libraries and programming languages.

It's not a casualty they say Python is the future language because its libraries simplify many operations where most people believe they would have needed a `for`

loop.

Let's apply other pandas techniques to the DateTime object:

`df_tesla['year'] = df_tesla.index.yeardf_tesla['month'] = df_tesla.index.month`

The following values represent the average Close price by each month-year combination:

`df_tesla.pivot_table(index='year', columns='month', values='Close', aggfunc='mean').round(2)`

We could even style it to get a better insight by colouring the cells:

`df_stl = df_tesla.pivot_table( index='year', columns='month', values='Close', aggfunc='mean', fill_value=0).style.format('{:.2f}').background_gradient(axis=1)df_stl`

And they represent the volatility with the standard deviation:

`df_stl = df_tesla.pivot_table( index='year', columns='month', values='Close', aggfunc='std', fill_value=0).style.format('{:.2f}').background_gradient(axis=1)df_stl`

In this article, we'll dig into the details of the Panda's DateTime-related object in Python to understand the required knowledge to come up with awesome calculations like the ones we saw above.

First, let's reload the dataset to start from the basics.

`df_tesla = pd.read_csv(url, parse_dates=['Date'])df_tesla`

An essential part of learning something is the practicability and the understanding of counterexamples where we understand the errors.

Let's go with basic thinking to understand the importance of the DateTime object and how to work with it. So, out of all the columns in the DataFrame, we'll now focus on `Date`

:

`df_tesla.Date`

`0 2017-01-031 2017-01-04 ... 1378 2022-06-241379 2022-06-27Name: Date, Length: 1380, dtype: datetime64[ns]`

What information could we get from a `DateTime`

object?

- We may think we can get the month, but it turns out we can't in the following manner:

`df_tesla.Date.month`

`---------------------------------------------------------------------------AttributeError Traceback (most recent call last)Input In [53], in <cell line: 1>()----> 1 df_tesla.Date.monthFile ~/miniforge3/lib/python3.9/site-packages/pandas/core/generic.py:5575, in NDFrame.__getattr__(self, name) 5568 if ( 5569 name not in self._internal_names_set 5570 and name not in self._metadata 5571 and name not in self._accessors 5572 and self._info_axis._can_hold_identifiers_and_holds_name(name) 5573 ): 5574 return self[name]-> 5575 return object.__getattribute__(self, name)AttributeError: 'Series' object has no attribute 'month'`

Programming exists to simplify our lives, not make them harder.

Someone has probably developed a simpler functionality if you think there must be a simpler way to perform certain operations. Therefore, don't limit programming applications to complex ideas and rush towards a `for`

loop, for example; proceed through trial and error without losing hope.

In short, we need to bypass the `dt`

instance to access the `DateTime`

functions:

`df_tesla.Date.dt`

`<pandas.core.indexes.accessors.DatetimeProperties object at 0x16230a2e0>`

`df_tesla.Date.dt.month`

`0 11 1 ..1378 61379 6Name: Date, Length: 1380, dtype: int64`

We can use more elements than just `.month`

:

`df_tesla.Date.dt.month_name()`

`0 January1 January ... 1378 June1379 JuneName: Date, Length: 1380, dtype: object`

`df_tesla.Date.dt.isocalendar()`

`df_tesla.Date.dt.quarter`

`0 11 1 ..1378 21379 2Name: Date, Length: 1380, dtype: int64`

`df_tesla.Date.dt.to_period('M')`

`0 2017-011 2017-01 ... 1378 2022-061379 2022-06Name: Date, Length: 1380, dtype: period[M]`

`df_tesla.Date.dt.to_period('W-FRI')`

`0 2016-12-31/2017-01-061 2016-12-31/2017-01-06 ... 1378 2022-06-18/2022-06-241379 2022-06-25/2022-07-01Name: Date, Length: 1380, dtype: period[W-FRI]`

Pandas contain functionality that allows us to place Time Zones into the objects to ease the work of data from different countries and regions.

Before getting deeper into Time Zones, we need to set the `Date`

as the `index`

(rows) of the `DataFrame`

:

`df_tesla.set_index('Date', inplace=True)df_tesla`

We can tell Python the `DateTimeIndex`

of the `DataFrame`

comes from Madrid:

`df_tesla.index = df_tesla.index.tz_localize('Europe/Madrid')df_tesla`

And **change** it to another Time Zone, like **Moscow**:

`df_tesla.index.tz_convert('Europe/Moscow')`

`DatetimeIndex(['2017-01-03 02:00:00+03:00', '2017-01-04 02:00:00+03:00', '2017-01-05 02:00:00+03:00', '2017-01-06 02:00:00+03:00', ... '2022-06-22 01:00:00+03:00', '2022-06-23 01:00:00+03:00', '2022-06-24 01:00:00+03:00', '2022-06-27 01:00:00+03:00'], dtype='datetime64[ns, Europe/Moscow]', name='Date', length=1380, freq=None)`

We could have applied the transformation in the `DataFrame`

object itself:

`df_tesla.tz_convert('Europe/Moscow')`

We can observe the hour has changed accordingly.

The **Pandas Time Zone** functionality is useful for combining timed data from different regions around the globe.

To summarise, for example, the information of daily operations into months, we can apply different functions with each one having its unique ability (it's up to you to select the one that suits your needs):

`.groupby()`

`.resample()`

`.pivot_table()`

Let's show some examples:

`df_tesla.groupby(by=df_tesla.index.year).Volume.sum()`

`Date2017 79501570002018 108081940002019 115402420002020 190529124002021 69026905002022 3407576732Name: Volume, dtype: int64`

The function `.groupby()`

packs the rows of the same year:

`df_tesla.groupby(by=df_tesla.index.year)`

`<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1622eecd0>`

To later summarise the total volume in each pack as we saw before.

An easier way?

`df_tesla.Volume.resample('Y').sum()`

`Date2017-12-31 00:00:00+01:00 79501570002018-12-31 00:00:00+01:00 108081940002019-12-31 00:00:00+01:00 115402420002020-12-31 00:00:00+01:00 190529124002021-12-31 00:00:00+01:00 69026905002022-12-31 00:00:00+01:00 3407576732Freq: A-DEC, Name: Volume, dtype: int64`

We first select the column in which we want to apply the operation:

`df_tesla.Volume`

`Date2017-01-03 00:00:00+01:00 296165002017-01-04 00:00:00+01:00 56067500 ... 2022-06-24 00:00:00+02:00 318665002022-06-27 00:00:00+02:00 21237332Name: Volume, Length: 1380, dtype: int64`

And apply the `.resample()`

function to take a Date Offset to aggregate the `DateTimeIndex`

. In this example, we aggregate by year `'Y'`

:

`df_tesla.Volume.resample('Y')`

`<pandas.core.resample.DatetimeIndexResampler object at 0x16230abe0>`

And apply mathematical operations to the aggregated objects separately as we saw before:

`df_tesla.Volume.resample('Y').sum()`

`Date2017-12-31 00:00:00+01:00 79501570002018-12-31 00:00:00+01:00 108081940002019-12-31 00:00:00+01:00 115402420002020-12-31 00:00:00+01:00 190529124002021-12-31 00:00:00+01:00 69026905002022-12-31 00:00:00+01:00 3407576732Freq: A-DEC, Name: Volume, dtype: int64`

We could have also calculated the `.sum()`

for all the columns if we didn't select just the `Volume`

:

`df_tesla.resample('Y').sum()`

As always, we should strive to represent the information in the clearest manner for anyone to understand. Therefore, we could even visualize the aggregated volume by year with two more words:

`df_tesla.Volume.resample('Y').sum().plot.bar();`

Let's now try different Date Offsets:

`df_tesla.Volume.resample('M').sum()`

`Date2017-01-31 00:00:00+01:00 5033980002017-02-28 00:00:00+01:00 597700000 ... 2022-05-31 00:00:00+02:00 6494072002022-06-30 00:00:00+02:00 572380932Freq: M, Name: Volume, Length: 66, dtype: int64`

`df_tesla.Volume.resample('M').sum().plot.line();`

`df_tesla.Volume.resample('W').sum()`

`Date2017-01-08 00:00:00+01:00 1428820002017-01-15 00:00:00+01:00 105867500 ... 2022-06-26 00:00:00+02:00 1412342002022-07-03 00:00:00+02:00 21237332Freq: W-SUN, Name: Volume, Length: 287, dtype: int64`

`df_tesla.Volume.resample('W').sum().plot.area();`

`df_tesla.Volume.resample('W-FRI').sum()`

`Date2017-01-06 00:00:00+01:00 1428820002017-01-13 00:00:00+01:00 105867500 ... 2022-06-24 00:00:00+02:00 1412342002022-07-01 00:00:00+02:00 21237332Freq: W-FRI, Name: Volume, Length: 287, dtype: int64`

`df_tesla.Volume.resample('W-FRI').sum().plot.line();`

`df_tesla.Volume.resample('Q').sum()`

`Date2017-03-31 00:00:00+02:00 16362745002017-06-30 00:00:00+02:00 2254740000 ... 2022-03-31 00:00:00+02:00 16788020002022-06-30 00:00:00+02:00 1728774732Freq: Q-DEC, Name: Volume, Length: 22, dtype: int64`

`df_tesla.Volume.resample('Q').sum().plot.bar();`

We can also use Pivot Tables for summarising and nicer represent the information:

`df_res = df_tesla.pivot_table( index=df_tesla.index.month, columns=df_tesla.index.year, values='Volume', aggfunc='sum')df_res`

And even apply some style to get more insight on the DataFrame:

`df_tesla['Volume_M'] = df_tesla.Volume/1_000_000dfres = df_tesla.pivot_table( index=df_tesla.index.month, columns=df_tesla.index.year, values='Volume_M', aggfunc='sum')df_stl = dfres.style.format('{:.2f}').background_gradient('Reds', axis=1)df_stl`

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

We have already covered:

Regression Models

Classification Models

Train Test Split for Model Selection

In short, we have computed all possible types of models to predict numerical and categorical variables with Regression and Classification models, respectively.

Although it is not enough to compute one model, we need to compare different models to choose the one whose predictions are the closest to reality.

Nevertheless, we cannot evaluate the model on the same data we used to `.fit()`

(train) the mathematical equation (model). Therefore, we need to separate the data into train and test sets; the first to train the model, the later to evaluate the model.

We add an extra layer of complexity because we can improve a model (an algorithm) by configuring its parameters. This chapter is about **computing different combinations of a single model's hyperparameters** to get the best.

The goal of this dataset is

To predict if

**bank's customers**(rows)`default`

next monthBased on their

**socio-demographical characteristics**(columns)

`import pandas as pdpd.set_option("display.max_columns", None)url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls'df_credit = pd.read_excel(io=url, header=1, index_col=0)df_credit.sample(10)`

The function `.fit()`

needs all the cells in the DataFrame to contain a value. NaN means "Not a Number" (i.e., cell for which we don't have any information). Otherwise, it won't know how to process the row and compare it to others.

`df_credit.isna().sum()`

LIMIT_BAL 0 SEX 0 .. PAY_AMT6 0 default payment next month 0 Length: 24, dtype: int64

`df_credit.isna().sum().sum()`

0

The function `.fit()`

needs the values to be numeric. Otherwise, Python won't know the position of the axes in which to allocate the point.

Therefore, categories of the categorical columns will be transformed into new columns (one new column per category) and contain 1s and 0s depending on whether the person is or is not in the category.

Nevertheless, **we don't need to create dummy variables** because the data contains numerical variables only.

So far, we have used the naming standard of **target** and **features**. Nevertheless, the most common standards on the Internet are **X** and **y**. Let's get used to it:

`y = df_credit.iloc[:, -1]X = df_credit.iloc[:, :-1]`

From the previous chapter, we should already know we need to separate the data into train and test if we want to evaluate the model's predictive capability for data we don't know yet.

In our case, we'd like to predict if new credit card customers won't commit default in the next month. As we don't have the data for the next month (it's the future), we need to apply the function `train_test_split()`

.

`from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42)`

`DecisionTreeClassifier()`

with Default HyperparametersTo compute a Machine Learning model with the **default hyperparameters**, we apply the same procedure we have covered in previous chapters:

`from sklearn.tree import DecisionTreeClassifiermodel_dt = DecisionTreeClassifier()model_dt.fit(X_train, y_train)`

DecisionTreeClassifier()

We can see the model is almost perfect for predicting the training data (99% of accuracy). Nevertheless, predicting test data is terrible (72% of accuracy). This phenomenon tells us that the model is incurring in **overfitting**.

`train`

data`model_dt.score(X_train, y_train)`

0.9995024875621891

`test`

data`model_dt.score(X_test, y_test)`

0.7265656565656565

I'll use the following visualization to explain the concept of overfitting.

`from sklearn.tree import plot_treeplot_tree(decision_tree=model_dt, feature_names=X_train.columns, filled=True);`

The tree is big because we have a lot of people (20,100), and we haven't set any limit on the model.

How many people do you think we have in the deepest leaf?

- Very few, probably one.

Are these people characteristic of the overall data? Or are they infrequent?

- Because they are infrequent and the model is very complex, we are incurring overfitting, and we get a vast difference between train and test accuracies.

`DecisionTreeClassifier()`

with Custom HyperparametersWhich hyperparameters can we configure for the Decision Tree algorithm?

In the output below, we can configure parameters such as `max_depth`

, `criterion`

and `min_samples_leaf`

, among others.

`model = DecisionTreeClassifier()model.get_params()`

{'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'random_state': None, 'splitter': 'best'}

Let's apply different random configurations to see how to model's accuracy changes in train and test sets.

Please pay attention to how the accuracies are similar when we reduce the model's complexity (we make the tree shorter and generalized to capture more people in the leaves).

And remember that we should pick up a good configuration based on the test accuracy.

`model_dt = DecisionTreeClassifier(max_depth=2, min_samples_leaf=150)model_dt.fit(X_train, y_train)`

DecisionTreeClassifier(max_depth=2, min_samples_leaf=150)

`train`

data`model_dt.score(X_train, y_train)`

0.8186567164179105

`test`

data`model_dt.score(X_test, y_test)`

0.8215151515151515

`plot_tree(decision_tree=model_dt, feature_names=X_train.columns, filled=True);`

Learn how to become an independent Machine Learning programmer who knows when to apply any ML algorithm to any dataset.

`model_dt = DecisionTreeClassifier(max_depth=3)model_dt.fit(X_train, y_train)`

DecisionTreeClassifier(max_depth=3)

`train`

data`model_dt.score(X_train, y_train)`

0.8207960199004976

`test`

data`model_dt.score(X_test, y_test)`

0.8222222222222222

`plot_tree(decision_tree=model_dt, feature_names=X_train.columns, filled=True);`

`model_dt = DecisionTreeClassifier(max_depth=4)model_dt.fit(X_train, y_train)`

DecisionTreeClassifier(max_depth=4)

`train`

data`model_dt.score(X_train, y_train)`

0.8232338308457712

`test`

data`model_dt.score(X_test, y_test)`

0.8205050505050505

`plot_tree(decision_tree=model_dt, feature_names=X_train.columns, filled=True);`

`model_dt = DecisionTreeClassifier(min_samples_leaf=100)model_dt.fit(X_train, y_train)`

DecisionTreeClassifier(min_samples_leaf=100)

`train`

data`model_dt.score(X_train, y_train)`

0.8244278606965174

`test`

data`model_dt.score(X_test, y_test)`

0.8161616161616162

`plot_tree(decision_tree=model_dt, feature_names=X_train.columns, filled=True);`

`model_dt = DecisionTreeClassifier(max_depth=7, min_samples_leaf=100)model_dt.fit(X_train, y_train)`

DecisionTreeClassifier(max_depth=7, min_samples_leaf=100)

`train`

data`model_dt.score(X_train, y_train)`

0.8237313432835821

`test`

data`model_dt.score(X_test, y_test)`

0.8177777777777778

`plot_tree(decision_tree=model_dt, feature_names=X_train.columns, filled=True);`

We have similar results; the accuracy goes around 82% on the test set when we configure a general model which doesn't have a considerable depth (as the first one).

But we should ask ourselves another question: can we do this process of automatically checking multiple combinations of hyperparameters?

- Yes, and that's where
**Cross Validation**gets in.

`GridSearchCV()`

to find Best HyperparametersThe Cross-Validation technique splits the training data into n number of folds (5 in the image below). Then, it computes each hyperparameter configuration n times, where each fold will be taken as a test set once.

Consider that we `.fit()`

a model as many times as folds are multiplied by the number of combinations we want to try.

Out of the Decision Tree hyperparameters:

`model_dt = DecisionTreeClassifier()model_dt.get_params()`

{'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'random_state': None, 'splitter': 'best'}

We want to try the following combinations of `max_depth (6)`

, `min_samples_leaf (7)`

and `criterion (2)`

:

`from sklearn.model_selection import GridSearchCVparam_grid = { 'max_depth': [None, 2, 3, 4, 5, 10], 'min_samples_leaf': [1, 50, 100, 200, 400, 800, 1600], 'criterion': ['gini', 'entropy']}cv_dt = GridSearchCV(estimator=model_dt, param_grid=param_grid, cv=5, verbose=1)`

They make up to 420 times we compute the function`.fit()`

`5*6*7*2`

420

To compare 84 different combinations of the Decision Tree hyperparameters:

`6*7*2`

84

`cv_dt.fit(X_train, y_train)`

Fitting 5 folds for each of 84 candidates, totalling 420 fits

GridSearchCV(cv=5, estimator=DecisionTreeClassifier(), param_grid={'criterion': ['gini', 'entropy'], 'max_depth': [None, 2, 3, 4, 5, 10], 'min_samples_leaf': [1, 50, 100, 200, 400, 800, 1600]}, verbose=1)

If we specify `verbose=2`

, we will see how many fits we perform in the output:

`cv_dt = GridSearchCV(estimator=model_dt, param_grid=param_grid, cv=5, verbose=2)cv_dt.fit(X_train, y_train)`

Fitting 5 folds for each of 84 candidates, totalling 420 fits [CV] END .criterion=gini, max_depth=None, min_samples_leaf=1; total time= 0.2s [CV] END .criterion=gini, max_depth=None, min_samples_leaf=1; total time= 0.2s ... [CV] END criterion=entropy, max_depth=10, min_samples_leaf=1600; total time= 0.1s [CV] END criterion=entropy, max_depth=10, min_samples_leaf=1600; total time= 0.1s

GridSearchCV(cv=5, estimator=DecisionTreeClassifier(), param_grid={'criterion': ['gini', 'entropy'], 'max_depth': [None, 2, 3, 4, 5, 10], 'min_samples_leaf': [1, 50, 100, 200, 400, 800, 1600]}, verbose=2)

The best hyperparameter configuration is:

`cv_dt.best_params_`

DecisionTreeClassifier(criterion='entropy', max_depth=5, min_samples_leaf=100)

To achieve accuracy on the test set of:

`cv_dt.score(X_test, y_test)`

0.8186868686868687

If we'd like to have the results of every configuration:

`df_cv_dt = pd.DataFrame(cv_dt.cv_results_)df_cv_dt`

Now let's try to find the best hyperparameter configuration of other models, which don't have the same hyperparameters as the Decision Tree because their algorithm and mathematical equation are different.

`SVC()`

Before computing the Support Vector Machines model, we need to scale the data because this model compares the distance between the explanatory variables. Therefore, they all need to be on the same scale.

`from sklearn.preprocessing import MinMaxScalerscaler = MinMaxScaler()X_norm = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)`

We need to separate the data again to have the train and test sets with the scaled data:

`>>> X_norm_train, X_norm_test, y_train, y_test = train_test_split(... X_norm, y, test_size=0.33, random_state=42)`

The Support Vector Machines contain the following hyperparameters:

`from sklearn.svm import SVCsv = SVC()sv.get_params()`

{'C': 1.0, 'break_ties': False, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 'scale', 'kernel': 'rbf', 'max_iter': -1, 'probability': False, 'random_state': None, 'shrinking': True, 'tol': 0.001, 'verbose': False}

From which we want to try the following combinations:

`param_grid = { 'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}cv_sv = GridSearchCV(estimator=sv, param_grid=param_grid, verbose=2)cv_sv.fit(X_norm_train, y_train)`

Fitting 5 folds for each of 6 candidates, totalling 30 fits [CV] END ...............................C=0.1, kernel=linear; total time= 3.0s [CV] END ...............................C=0.1, kernel=linear; total time= 3.0s ... [CV] END ...................................C=10, kernel=rbf; total time= 5.3s [CV] END ...................................C=10, kernel=rbf; total time= 5.3s

GridSearchCV(estimator=SVC(), param_grid={'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}, verbose=2)

We should notice that some fits take up to almost 5 seconds, which is very time-expensive if we want to try thousands of combinations (professionals apply these practices). Therefore, we should know how the model's algorithm works inside to choose a good hyperparameter configuration that doesn't devote much time. Otherwise, we make the company spend a lot of money on computing power.

This tutorial dissects the Support Vector Machines algorithm works inside.

The best hyperparameter configuration is:

`cv_sv.best_params_`

SVC(C=10)

To achieve an accuracy on the test set of:

`cv_sv.score(X_norm_test, y_test)`

0.8185858585858586

If we'd like to have the results of every configuration:

`df_cv_sv = pd.DataFrame(cv_sv.cv_results_)df_cv_sv`

`KNeighborsClassifier()`

Now we'll compute another classification model: K Nearest Neighbours.

We check for its hyperparameters:

`from sklearn.neighbors import KNeighborsClassifiermodel_kn = KNeighborsClassifier()model_kn.get_params()`

{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', 'metric_params': None, 'n_jobs': None, 'n_neighbors': 5, 'p': 2, 'weights': 'uniform'}

To choose the following combinations:

`param_grid = { 'leaf_size': [10, 20, 30, 50], 'metric': ['minkowski', 'euclidean', 'manhattan'], 'n_neighbors': [3, 5, 10, 20]}cv_kn = GridSearchCV(estimator=kn, param_grid=param_grid, verbose=2)cv_kn.fit(X_norm_train, y_train)`

Fitting 5 folds for each of 48 candidates, totalling 240 fits [CV] END ......leaf_size=10, metric=minkowski, n_neighbors=3; total time= 1.5s [CV] END ......leaf_size=10, metric=minkowski, n_neighbors=3; total time= 1.3s ... [CV] END .....leaf_size=50, metric=manhattan, n_neighbors=20; total time= 1.1s [CV] END .....leaf_size=50, metric=manhattan, n_neighbors=20; total time= 1.1s

GridSearchCV(estimator=KNeighborsClassifier(), param_grid={'leaf_size': [10, 20, 30, 50], 'metric': ['minkowski', 'euclidean', 'manhattan'], 'n_neighbors': [3, 5, 10, 20]}, verbose=2)

The best hyperparameter configuration is:

`cv_kn.best_params_`

KNeighborsClassifier(leaf_size=10, n_neighbors=20)

To achieve an accuracy on the test set of:

`cv_kn.score(X_norm_test, y_test)`

0.8185858585858586

If we'd like to have the results of every configuration:

`df_cv_kn = pd.DataFrame(cv_kn.cv_results_)df_cv_kn`

The best algorithm at its best is the Decision Tree Classifier:

`dic_results = { 'model': [ cv_dt.best_estimator_, cv_sv.best_estimator_, cv_kn.best_estimator_ ], 'hyperparameters': [ cv_dt.best_params_, cv_sv.best_params_, cv_kn.best_params_ ], 'score': [ cv_dt.score(X_test, y_test), cv_sv.score(X_norm_test, y_test), cv_kn.score(X_norm_test, y_test) ]}df_cv_comp = pd.DataFrame(dic_results)df_cv_comp.style.background_gradient()`

]]>Ask him any doubt on **Twitter** or **LinkedIn**

Machine Learning models learn a mathematical equation from historical data.

Not all Machine Learning models predict the same way; some models are better than others.

We measure how good a model is by calculating its score (accuracy).

So far, we have calculated the model's score using the same data to fit (train) the mathematical equation. That's cheating. That's overfitting.

This tutorial compares 3 different models:

- Decision Tree
- Logistic Regression
- Support Vector Machines

We validate the models in 2 different ways:

- Using the same data during training
- Using 30% of the data; not used during training

To demonstrate how the selection of the best model changes if we are to validate the model with data not used during training.

For example, the image below shows the best model, when using the same data for validation, is the Decision Tree (0.86 of accuracy). Nevertheless, everything changes when the model is evaluated with data not used during training; the best model is the Logistic Regression (0.85 of accuracy). Whereas the Decision Tree only gets up to 0.80 of accuracy.

Were we a bank whose losses rank up to 1M USD due to 0.01 fail in accuracy, we would have lost 5M USD. This is something that happens in real life.

In short, banks are interested in good models to predict new potential customers. Not historical customers who have already gotten a loan and the bank knows if they were good to pay or not.

This tutorial shows you how to implement the `train_test_split`

technique to reduce overfitting with a practical use case where we want to classify whether a person used the Internet or not.

Load the dataset from CIS, executing the following lines of code:

`import pandas as pd #!df_internet = pd.read_excel('https://github.com/jsulopzs/data/blob/main/internet_usage_spain.xlsx?raw=true', sheet_name=1, index_col=0)df_internet`

- The goal of this dataset is
- To predict
`internet_usage`

of**people**(rows) - Based on their
**socio-demographical characteristics**(columns)

We should already know from the previous chapter that the data might be preprocessed before passing it to the function that computes the mathematical equation.

The function `.fit()`

all the cells in the DataFrame to contain a value. `NaN`

means "Not a Number" (i.e., cell for which we don't have any information). Otherwise, it won't know how to process the row and compare it to others.

For example, if you miss John's age, you cannot place John in the space to compare with other people because the point might be anywhere.

`df_internet.isna().sum()`

`internet_usage 0sex 0age 0education 0dtype: int64`

The function `.fit()`

needs the values to be numeric. Otherwise, Python won't know the position of the axes in which to allocate the point. For example, if you have *Male* and *Female*, at which distance do you separate them, and why? You cannot make an objective assessment unless you separate each category.

Therefore, categories of the categorical columns will be transformed into new columns (one new column per category) and contain 1s and 0s depending on whether the person is or is not in the category.

`df_internet = pd.get_dummies(df_internet, drop_first=True)df_internet`

Once we have preprocessed the data, we select the column we want to predict (target) and the columns we will use to explain the prediction (features/explanatory).

`target = df_internet.internet_usagefeatures = df_internet.drop(columns='internet_usage')`

We should already know that the Machine Learning procedure is the same all the time:

- Computing a mathematical equation:
**fit** - To calculate predictions:
**predict** - And compare them to reality:
**score**

The only element that changes is the `Class()`

that contains lines of code of a specific algorithm (DecisionTreeClassifier, SVC, LogisticRegression).

`DecisionTreeClassifier()`

Model in Python`from sklearn.tree import DecisionTreeClassifiermodel_dt = DecisionTreeClassifier()model_dt.fit(X=features, y=target)model_dt.score(X=features, y=target)`

`0.859877800407332`

`SVC()`

Model in Python`from sklearn.svm import SVCmodel_svc = SVC(probability=True)model_svc.fit(X=features, y=target)model_svc.score(X=features, y=target)`

`0.7837067209775967`

`LogisticRegression()`

Model in Python`from sklearn.linear_model import LogisticRegressionmodel_lr = LogisticRegression(max_iter=1000)model_lr.fit(X=features, y=target)model_lr.score(X=features, y=target)`

`0.8334012219959267`

- We repeated all the time the same code:

`model.fit()model.score()`

- Why not turn the lines into a
`function()`

to**automate the process**?

`calculate_accuracy(model_dt)calculate_accuracy(model_sv)calculate_accuracy(model_lr)`

- To calculate the
`accuracy`

`DecisionTreeClassifier()`

`model_dt = DecisionTreeClassifier()model_dt.fit(X=features, y=target)model_dt.score(X=features, y=target)`

`0.859877800407332`

`function()`

**Code Thinking**

- Think of the functions
`result`

- Store that
`object`

to a variable `return`

the`result`

at the end**Indent the body**of the function to the right`def`

ine the`function():`

- Think of what's gonna change when you execute the function with
`different models`

- Locate the
`variable`

that you will change - Turn it into the
`parameter`

of the`function()`

`model_dt = DecisionTreeClassifier()model_dt.fit(X=features, y=target)model_dt.score(X=features, y=target)`

`0.859877800407332`

`result`

you want and put it into a variable`model_dt = DecisionTreeClassifier()model_dt.fit(X=features, y=target)result = model_dt.score(X=features, y=target) #new`

`return`

to tell the function the object you want in the end`model_dt = DecisionTreeClassifier()model_dt.fit(X=features, y=target)result = model_dt.score(X=features, y=target)return result #new`

` model_dt = DecisionTreeClassifier() model_dt.fit(X=features, y=target) result = model_dt.score(X=features, y=target) return result`

`def calculate_accuracy(): #new model_dt = DecisionTreeClassifier() model_dt.fit(X=features, y=target) result = model_dt.score(X=features, y=target) return result`

`def calculate_accuracy(model_dt): #modified model_dt.fit(X=features, y=target) result = model_dt.score(X=features, y=target) return result`

`def calculate_accuracy(model): #modified model.fit(X=features, y=target) #modified result = model.score(X=features, y=target) return result`

`def calculate_accuracy(model): """ This function calculates the accuracy for a given model as a parameter #modified """ model.fit(X=features, y=target) result = model.score(X=features, y=target) return result`

`calculate_accuracy(model_dt)`

`0.859877800407332`

`DecisionTreeClassifier()`

Accuracy`calculate_accuracy(model_dt)`

`0.859877800407332`

We shall create an empty dictionary that keeps track of every model's score to choose the best one later.

`dic_accuracy = {}dic_accuracy['Decision Tree'] = calculate_accuracy(model_dt)`

`SVC()`

Accuracy`dic_accuracy['Support Vector Machines'] = calculate_accuracy(model_svc)dic_accuracy`

`{'Decision Tree': 0.859877800407332, 'Support Vector Machines': 0.7837067209775967}`

`LogisticRegression()`

Accuracy`dic_accuracy['Logistic Regression'] = calculate_accuracy(model_lr)dic_accuracy`

`{'Decision Tree': 0.859877800407332, 'Support Vector Machines': 0.7837067209775967, 'Logistic Regression': 0.8334012219959267}`

The Decision Tree is the best model with an score of 85%:

`sr_accuracy = pd.Series(dic_accuracy).sort_values(ascending=False)sr_accuracy`

`Decision Tree 0.859878Logistic Regression 0.833401Support Vector Machines 0.783707dtype: float64`

Let's suppose for a moment we are a bank to understand the importance of this chapter. A bank's business is, among other things, to give loans to people who can afford it.

Although the bank may commit mistakes: giving loans to people who cannot afford it or not giving to people who can.

Let's imagine the bank losses of 1M for each 1% of misclassification. As we chose the Decision Tree, the bank lost $15M, as the score suggests. Nevertheless, can we trust the score of 85%?

No, because we are cheating the model's evaluation; we evaluated the models with the same data used for training. In other words, the bank is not interested in evaluating the model of the historical customers; they want to know how good the model is for new customers.

They cannot create new customers. What can they do then?

They separate the data into a train set (70% of customers) used to `.fit()`

the mathematical equation and a test set (30% of customers) to evaluate the mathematical equation.

You can understand the problem better with the following analogy:

Let's **imagine**:

- You have a
`math exam`

on Saturday - Today is Monday
- You want to
**calibrate your level in case you need to study more**for the math exam - How do you calibrate your
`math level`

? - Well, you've got
**100 questions**from past years exams`X`

with 100 solutions`y`

- You may study the 100 questions with 100 solutions
`fit(100questions, 100solutions)`

- Then, you may do a
`mock exam`

with the 100 questions`predict(100questions)`

- And compare
`your_100solutions`

with the`real_100solutions`

- You've got
**90/100 correct answers**`accuracy`

in the mock exam - You think you are
**prepared for the maths exam** - And when you do
**the real exam on Saturday, the mark is 40/100** - Why? How could we have prevented this?
**Solution**: separate the 100 questions into`70 for train`

to study &`30 for test`

for the mock exam.- fit(70questions, 70answers)
- your_30solutions = predict(30questions)
- your_30solutions ?= 30solutions

`train_test_split()`

the Data- The documentation of the function contains a typical example.

`from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split( features, target, test_size=0.30, random_state=42)`

From all the data:

- 2455 rows
- 8 columns

`df_internet`

- 1728 rows (70% of all data) to fit the model
- 7 columns (X: features variables)

`X_train`

- 737 rows (30% of all data) to evaluate the model
- 7 columns (X: features variables)

`X_test`

- 1728 rows (70% of all data) to fit the model
- 1 columns (y: target variable)

`y_train`

`nameEileen 0Lucinda 1 ..Corey 0Robert 1Name: internet_usage, Length: 1718, dtype: int64`

- 737 rows (30% of all data) to evaluate the model
- 1 columns (y: target variable)

`y_test`

`nameThomas 0Pedro 1 ..William 1Charles 1Name: internet_usage, Length: 737, dtype: int64`

`fit()`

the model with Train Data`model_dt.fit(X_train, y_train)`

`DecisionTreeClassifier()`

`model_dt.score(X_test, y_test)`

`0.8046132971506106`

`DecisionTreeClassifier()`

`model_dt = DecisionTreeClassifier()model_dt.fit(X_train, y_train)model_dt.score(X_test, y_test)`

`0.8032564450474898`

`function()`

**Code Thinking**

- Think of the functions
`result`

- Store that
`object`

to a variable `return`

the`result`

at the end**Indent the body**of the function to the right`def`

ine the`function():`

- Think of what's gonna change when you execute the function with
`different models`

- Locate the
`variable`

that you will change - Turn it into the
`parameter`

of the`function()`

`def calculate_accuracy_test(model): model.fit(X_train, y_train) result = model.score(X_test, y_test) return result`

`DecisionTreeClassifier()`

Accuracy`dic_accuracy_test = {}dic_accuracy_test['Decision Tree'] = calculate_accuracy_test(model_dt)dic_accuracy_test`

`{'Decision Tree': 0.8032564450474898}`

`SVC()`

Accuracy`dic_accuracy_test['Support Vector Machines'] = calculate_accuracy_test(model_svc)dic_accuracy_test`

`{'Decision Tree': 0.8032564450474898, 'Support Vector Machines': 0.7788331071913162}`

`LogisticRegression()`

Accuracy`dic_accuracy_test['Logistic Regression'] = calculate_accuracy_test(model_lr)dic_accuracy_test`

`{'Decision Tree': 0.8032564450474898, 'Support Vector Machines': 0.7788331071913162, 'Logistic Regression': 0.8548168249660787}`

`train_test_split()`

?The picture change quite a lot as the bank is losing 20M due to the model we chose before: the Decision Tree; the score in data that hasn't been seen during training (i.e., new customers) is 80%.

We should have chosen the Logistic Regression because it's the best model (85%) to predict new data and new customers.

In short, we lose 15M if we choose the Logistic Regression, which it's better than the Decision Tree's loss of 20M. Those 5M can make a difference in my life 馃憖

`sr_accuracy_test = pd.Series(dic_accuracy_test).sort_values(ascending=False)sr_accuracy_test`

`Logistic Regression 0.854817Decision Tree 0.803256Support Vector Machines 0.778833dtype: float64`

`df_accuracy = pd.DataFrame({ 'Same Data': sr_accuracy, 'Test Data': sr_accuracy_test})df_accuracy.style.format('{:.2f}').background_gradient()`

]]>Look at the following example as an aspiration you can achieve if you fully understand and replicate this whole tutorial with your data.

Let's load a dataset that contains information from countries (rows) considering socio-demographic and economic variables (columns).

`import plotly.express as pxdf_countries = px.data.gapminder()df_countries`

Python contains 3 main libraries for Data Visualization:

**Matplotlib**(Mathematical Plotting)**Seaborn**(High-Level based on Matplotlib)**Plotly**(Animated Plots)

I love `plotly`

because the Visualizations are interactive; you may hover the mouse over the points to get information from them:

`df_countries_2007 = df_countries.query('year == 2007')px.scatter(data_frame=df_countries_2007, x='gdpPercap', y='lifeExp', color='continent', hover_name='country', size='pop')`

You can even animate the plots with a simple parameter. Click on play

PS: The following example is taken from the official plotly library website:

`px.scatter(df_countries, x="gdpPercap", y="lifeExp", animation_frame="year", animation_group="country", size="pop", color="continent", hover_name="country", log_x=True, size_max=55, range_x=[100,100000], range_y=[25,90])`

In this article, we'll dig into the details of Data Visualization in Python to build up the required knowledge and develop awesome visualizations like the ones we saw before.

Matplotlib is a library used for Data Visualization.

We use the **sublibrary** (module) `pyplot`

from `matplotlib`

library to access the functions.

`import matplotlib.pyplot as plt`

Let's make a bar plot:

`plt.bar(x=['Real Madrid', 'Barcelona', 'Bayern Munich'], height=[14, 5, 6]);`

We could have also done a point plot:

`plt.scatter(x=['Real Madrid', 'Barcelona', 'Bayern Munich'], y=[14, 5, 6]);`

But it doesn't make sense with the data we have represented.

Let's create a DataFrame:

`teams = ['Real Madrid', 'Barcelona', 'Bayern Munich']uefa_champions = [14, 5, 6]import pandas as pddf_champions = pd.DataFrame(data={'Team': teams, 'UEFA Champions': uefa_champions})df_champions`

And visualize it using:

`plt.bar(x=df_champions['Team'], height=df_champions['UEFA Champions']);`

`df_champions.plot.bar(x='Team', y='UEFA Champions');`

Let's read another dataset: the Football Premier League classification for 2021/2022.

`df_premier = pd.read_excel(io='../data/premier_league.xlsx')df_premier`

We will visualize a point plot, from now own **scatter plot** to check if there is a relationship between the number of goals scored `F`

versus the Points `Pts`

.

`import seaborn as snssns.scatterplot(x='F', y='Pts', data=df_premier);`

Can we do the same plot with matplotlib `plt`

library?

`plt.scatter(x='F', y='Pts', data=df_premier);`

Which are the differences between them?

The points:

`matplotlib`

points are bigger than`seaborn`

onesThe axis labels:

`matplotlib`

axis labels are non-existent, whereas`seaborn`

places the names of the columns

From which library do the previous functions return the objects?

`seaborn_plot = sns.scatterplot(x='F', y='Pts', data=df_premier);`

`matplotlib_plot = plt.scatter(x='F', y='Pts', data=df_premier);`

`type(seaborn_plot)`

matplotlib.axes._subplots.AxesSubplot

`type(matplotlib_plot)`

matplotlib.collections.PathCollection

Why does `seaborn`

returns a `matplotlib`

object?

Quoted from the seaborn official website:

Seaborn is a Python data visualization library

based on matplotlib. It provides ahigh-level* interfacefor drawing attractive and informative statistical graphics.

*High-level means the communication between humans and the computer is easier to understand than low-level communication, which goes through 0s and 1s.

Could you place the names of the teams in the points?

`plt.scatter(x='F', y='Pts', data=df_premier)for idx, data in df_premier.iterrows(): plt.text(x=data['F'], y=data['Pts'], s=data['Team'])`

It isn't straightforward.

Is there an easier way?

Yes, you may use an interactive plot with `plotly`

library and display the name of the Team as you hover the mouse on a point.

We use the `express`

module within `plotly`

library to access the functions of the plots:

`import plotly.express as pxpx.scatter(data_frame=df_premier, x='F', y='Pts', hover_name='Team')`

Learn how to become an independent Data Analyist programmer who knows how to extract meaningful insights from Data Visualizations.

Let's read another dataset: the sociological data of clients in a restaurant.

`df_tips = sns.load_dataset(name='tips')df_tips`

`df_tips.sex`

0 Female 1 Male ...

242 Male 243 Female Name: sex, Length: 244, dtype: category Categories (2, object): ['Male', 'Female']

We need to summarise the data first; we count how many `Female`

and `Male`

people are in the dataset.

`df_tips.sex.value_counts()`

Male 157 Female 87 Name: sex, dtype: int64

`sr_sex = df_tips.sex.value_counts()`

Let's place bars equal to the number of people from each gender:

`px.bar(x=sr_sex.index, y=sr_sex.values)`

We can also colour the bars based on the category:

`px.bar(x=sr_sex.index, y=sr_sex.values, color=sr_sex.index)`

Let's put the same data into a pie plot:

`px.pie(names=sr_sex.index, values=sr_sex.values, color=sr_sex.index)`

`df_tips.total_bill`

0 16.99 1 10.34 ...

242 17.82 243 18.78 Name: total_bill, Length: 244, dtype: float64

Instead of observing the numbers, we can visualize the distribution of the bills in a **histogram**.

We can observe that most people pay between 10 and 20 dollars. Whereas a few are between 40 and 50.

`px.histogram(x=df_tips.total_bill)`

We can also create a **boxplot** where the limits of the boxes indicate the 1st and 3rd quartiles.

The 1st quartile is 13.325, and the 3rd quartile is 24.175. Therefore, 50% of people were billed an amount between these limits.

`px.box(x=df_tips.total_bill)`

`df_tips[['total_bill', 'tip']]`

We use a scatter plot to see if a relationship exists between two numerical variables.

Do the points go up as you move the eyes from left to right?

As you may observe in the following plot: the higher the amount of the bill, the higher the tip the clients leave for the staff.

`px.scatter(x='total_bill', y='tip', data_frame=df_tips)`

Another type of visualization for 2 continuous variables:

`px.density_contour(x='total_bill', y='tip', data_frame=df_tips)`

`df_tips[['day', 'total_bill']]`

We can summarise the data around how much revenue was generated in each day of the week.

`df_tips.groupby('day').total_bill.sum()`

day Thur 1096.33 Fri 325.88 Sat 1778.40 Sun 1627.16 Name: total_bill, dtype: float64

`sr_days = df_tips.groupby('day').total_bill.sum()`

We can observe that Saturday is the most profitable day as people have spent more money.

`px.bar(x=sr_days.index, y=sr_days.values)`

`px.bar(x=sr_days.index, y=sr_days.values, color=sr_days.index)`

`df_tips[['day', 'size']]`

Which combination of day-size is the most frequent table you can observe in the restaurant?

The following plot shows that Saturdays with 2 people at the table is the most common phenomenon at the restaurant.

They could create an advertisement that targets couples to have dinner on Saturdays and make more money.

`px.density_heatmap(x='day', y='size', data_frame=df_tips)`

The following examples are taken directly from plotly.

`df_gapminder = px.data.gapminder()px.scatter_geo(df_gapminder, locations="iso_alpha", color="continent", #! hover_name="country", size="pop", animation_frame="year", projection="natural earth")`

`import plotly.express as pxdf = px.data.election()geojson = px.data.election_geojson()fig = px.choropleth_mapbox(df, geojson=geojson, color="Bergeron", locations="district", featureidkey="properties.district", center={"lat": 45.5517, "lon": -73.7073}, mapbox_style="carto-positron", zoom=9)fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})`

`import plotly.express as pxdf = px.data.election()geojson = px.data.election_geojson()fig = px.choropleth_mapbox(df, geojson=geojson, color="winner", locations="district", featureidkey="properties.district", center={"lat": 45.5517, "lon": -73.7073}, mapbox_style="carto-positron", zoom=9)fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})`

Machine Learning is all about calculating the best numbers of a mathematical equation.

The form of a Linear Regression mathematical equation is as follows:

$$y = (a) + (b) \cdot x$$

As we see in the following plot, **not any mathematical equation is valid**; the red line doesn't fit the real data (blue points), whereas the green one is the best.

How do we understand the development of Machine Learning models in Python **to predict what may happen in the future**?

This tutorial covers the topics described below using **USA Car Crashes data** to predict the accidents based on alcohol.

- Step-by-step procedure to compute a Linear Regression:
`.fit()`

the numbers of the mathematical equation`.predict()`

the future with the mathematical equation`.score()`

how good is the mathematical equation

- How to
**visualise**the Linear Regression model? - How to
**evaluate**Regression models step by step?- Residuals Sum of Squares
- Total Sum of Squares
- R Squared Ratio \(R^2\)

- How to
**interpret**the coefficients of the Linear Regression? - Compare the Linear Regression to other Machine Learning models such as:
- Random Forest
- Support Vector Machines

- Why
**we don't need to know the maths**behind every model to apply Machine Learning in Python?

- This dataset contains
**statistics about Car Accidents**(columns) - In each one of
**USA States**(rows)

Visit this website if you want to know the measures of the columns.

`import seaborn as sns #!df_crashes = sns.load_dataset(name='car_crashes', index_col='abbrev')[['alcohol', 'total']]df_crashes.rename({'total': 'accidents'}, axis=1, inplace=True)df_crashes`

- As always, we need to use a function

Where is the function?

- It should be in a library

Which is the Python library for Machine Learning?

- Sci-Kit Learn, see website

How can we access the function to compute a Linear Regression model?

- We need to import the
`LinearRegression`

class within`linear_model`

module:

`from sklearn.linear_model import LinearRegression`

- Now, we create an instance
`model_lr`

of the class`LinearRegression`

:

`model_lr = LinearRegression()`

Which function applies the Linear Regression **algorithm** in which the **Residual Sum of Squares is minimised**?

`model_lr.fit()`

TypeError Traceback (most recent call last)

Input In [186], in ()----> 1 model_lr.fit()

TypeError: fit() missing 2 required positional arguments: 'X' and 'y'

Why is it asking for two parameters: `y`

and `X`

?

The algorithm must distinguish between the variable we want to predict (`y`

), and the variables used to explain (`X`

) the prediction.

`y`

: target ~ independent ~ label ~ class variable`X`

: features ~ dependent ~ explanatory variables

`target = df_crashes['accidents']features = df_crashes[['alcohol']]`

`model_lr.fit(X=features, y=target)`

LinearRegression()

Take the historical data:

`features`

To calculate predictions through the Model's Mathematical Equation:

`model_lr.predict(X=features)`

array([17.32111171, 15.05486718, 16.44306899, 17.69509287, 12.68699734, 13.59756016, 13.76016066, 15.73575679, 9.0955587 , 16.40851638, 13.78455074, 20.44100889, 14.87600663, 14.70324359, 14.40446516, 13.8353634 , 14.54064309, 15.86177218, 19.6076813 , 15.06502971, 13.98780137, 11.69106925, 13.88211104, 11.5162737 , 16.94713055, 16.98371566, 24.99585551, 16.45729653, 15.41868581, 12.93089809, 12.23171592, 15.95526747, 13.10772614, 16.44306899, 26.26007443, 15.60161138, 17.58737003, 12.62195713, 17.32517672, 14.43088774, 25.77430543, 18.86988151, 17.3515993 , 20.84141263, 9.53254755, 14.15040187, 12.82724027, 12.96748321, 19.40239816, 15.11380986, 17.17477126])

Can you see the difference between reality and prediction?

- Model predictions aren't perfect; they don't predict the real data exactly. Nevertheless, they make a fair approximation allowing decision-makers to understand the future better.

`df_crashes['pred_lr'] = model_lr.predict(X=features)df_crashes`

The orange dots reference the predictions lined up in a line because the Linear Regression model calculates the best coefficients (numbers) for a line's mathematical equation based on historical data.

`import matplotlib.pyplot as plt`

`sns.scatterplot(x='alcohol', y='accidents', data=df_crashes)sns.scatterplot(x='alcohol', y='pred_lr', data=df_crashes);`

We have orange dots for the alcohol represented in our `DataFrame`

. Were we to make estimations about all possible alcohol numbers, we'd get a **sequence of consecutive points**, which represented a line. Let's draw it with `.lineplot()`

function:

`sns.scatterplot(x='alcohol', y='accidents', data=df_crashes)sns.scatterplot(x='alcohol', y='pred_lr', data=df_crashes);sns.lineplot(x='alcohol', y='pred_lr', data=df_crashes, color='orange');`

To measure the quality of the model, we use the `.score()`

function to correctly calculate the difference between the model's predictions and reality.

`model_lr.score(X=features, y=target)`

0.7269492966665405

The step-by-step procedure of the previous calculation starts with the difference between reality and predictions:

`df_crashes['accidents'] - df_crashes['pred_lr']`

abbrev AL 1.478888 AK 3.045133 ...

WI -1.313810 WY 0.225229 Length: 51, dtype: float64

This difference is usually called **residuals**:

`df_crashes['residuals'] = df_crashes['accidents'] - df_crashes['pred_lr']df_crashes`

We cannot use all the residuals to tell how good our model is. Therefore, we need to add them up:

`df_crashes.residuals.sum()`

1.4033219031261979e-13

Let's round to two decimal points to suppress the scientific notation:

`df_crashes.residuals.sum().round(2)`

0.0

But we get ZERO. Why?

The residuals contain positive and negative numbers; some points are above the line, and others are below the line.

To turn negative values into positive values, we square the residuals:

`df_crashes['residuals^2'] = df_crashes.residuals**2df_crashes`

And finally, add the residuals up to calculate the **Residual Sum of Squares (RSS)**:

`df_crashes['residuals^2'].sum()`

231.96888653310063

`RSS = df_crashes['residuals^2'].sum()`

$$RSS = \sum(y_i - \hat{y})^2$$

where

- y_i is the real number of accidents
- $\hat y$ is the predicted number of accidents
- RSS: Residual Sum of Squares

The model was made to predict the number of accidents.

We should ask: how good are the variation of the model's predictions compared to the variation of the real data (real number of accidents)?

We have already calculated the variation of the model's prediction. Now we calculate the variation of the real data by comparing each accident value to the average:

`df_crashes.accidents`

abbrev AL 18.8 AK 18.1 ... WI 13.8 WY 17.4 Name: accidents, Length: 51, dtype: float64

`df_crashes.accidents.mean()`

15.79019607843137

$$y_i - \bar y$$

Where x is the number of accidents

`df_crashes.accidents - df_crashes.accidents.mean()`

abbrev AL 3.009804 AK 2.309804 ...

WI -1.990196 WY 1.609804 Name: accidents, Length: 51, dtype: float64

`df_crashes['real_residuals'] = df_crashes.accidents - df_crashes.accidents.mean()df_crashes`

We square the residuals due for the same reason as before (convert negative values into positive ones):

`df_crashes['real_residuals^2'] = df_crashes.real_residuals**2`

$$TTS = \sum(y_i - \bar y)^2$$

where

- y_i is the number of accidents
- $\bar y$ is the average number of accidents
- TTS: Total Sum of Squares

And we add up the values to get the **Total Sum of Squares (TSS)**:

`df_crashes['real_residuals^2'].sum()`

849.5450980392156

`TSS = df_crashes['real_residuals^2'].sum()`

The ratio between RSS and TSS represents how much our model fails concerning the variation of the real data.

`RSS/TSS`

0.2730507033334595

0.27 is the badness of the model as **RSS** represents the **residuals** (errors) of the model.

To calculate the **goodness** of the model, we need to subtract the ratio RSS/TSS to 1:

$$R^2 = 1 - \frac{RSS}{TSS} = 1 - \frac{\sum(y_i - \hat{y})^2}{\sum(y_i - \bar y)^2}$$

`1 - RSS/TSS`

0.7269492966665405

The model can explain 72.69% of the total number of accidents variability.

The following image describes how we calculate the goodness of the model.

How do we get the numbers of the mathematical equation of the Linear Regression?

- We need to look inside the object
`model_lr`

and show the attributes with`.__dict__`

(the numbers were computed with the`.fit()`

function):

`model_lr.__dict__`

{'fit_intercept': True, 'normalize': 'deprecated', 'copy_X': True, 'n_jobs': None, 'positive': False, 'feature_names*in*': array(['alcohol'], dtype=object), 'n_features*in*': 1, 'coef_': array([2.0325063]), '*residues': 231.9688865331006, 'rank*': 1, 'singular*': array([12.22681605]), 'intercept*': 5.857776154826299}

`intercept_`

is the (a) number of the mathematical equation`coef_`

is the (b) number of the mathematical equation

$$accidents = (a) + (b) \cdot alcohol \accidents = (intercept_) + (coef_) \cdot alcohol \accidents = (5.857) + (2.032) \cdot alcohol$$

For every unit of alcohol increased, the number of accidents will increase by 2.032 units.

`import pandas as pddf_to_pred = pd.DataFrame({'alcohol': [1,2,3,4,5]})df_to_pred['pred_lr'] = 5.857 + 2.032 * df_to_pred.alcoholdf_to_pred['diff'] = df_to_pred.pred_lr.diff()df_to_pred`

Could we make a better model that improves the current Linear Regression Score?

`model_lr.score(X=features, y=target)`

0.7269492966665405

- Let's try a Random Forest and a Support Vector Machines.

Do we need to know the maths behind these models to implement them in Python?

No. As we explain in this tutorial, all you need to do is:

`fit()`

`.predict()`

`.score()`

- Repeat

`RandomForestRegressor()`

in Python`from sklearn.ensemble import RandomForestRegressormodel_rf = RandomForestRegressor()model_rf.fit(X=features, y=target)`

RandomForestRegressor()

`model_rf.predict(X=features)`

array([18.644 , 16.831 , 17.54634286, 21.512 , 12.182 , 13.15 , 12.391 , 17.439 , 7.775 , 17.74664286, 14.407 , 18.365 , 15.101 , 14.132 , 13.553 , 15.097 , 15.949 , 19.857 , 21.114 , 15.53 , 13.241 , 8.98 , 14.363 , 9.54 , 17.208 , 16.593 , 22.087 , 16.24144286, 14.478 , 11.51 , 11.59 , 18.537 , 11.77 , 17.54634286, 23.487 , 14.907 , 20.462 , 12.59 , 18.38 , 12.449 , 23.487 , 20.311 , 19.004 , 19.22 , 9.719 , 13.476 , 12.333 , 11.08 , 22.368 , 14.67 , 17.966 ])

`df_crashes['pred_rf'] = model_rf.predict(X=features)`

`model_rf.score(X=features, y=target)`

0.9549469198566546

Let's create a dictionary that stores the Score of each model:

`dic_scores = {}dic_scores['lr'] = model_lr.score(X=features, y=target)dic_scores['rf'] = model_rf.score(X=features, y=target)`

`SVR()`

in Python`from sklearn.svm import SVRmodel_sv = SVR()model_sv.fit(X=features, y=target)`

SVR()

`model_sv.predict(X=features)`

array([18.29570777, 15.18462721, 17.2224187 , 18.6633175 , 12.12434781, 13.10691581, 13.31612684, 16.21131216, 12.66062465, 17.17537208, 13.34820949, 19.38920329, 14.91415215, 14.65467023, 14.2131504 , 13.41560202, 14.41299448, 16.39752499, 19.4896662 , 15.20002787, 13.62200798, 11.5390483 , 13.47824339, 11.49818909, 17.87053595, 17.9144274 , 19.60736085, 17.24170425, 15.73585463, 12.35136579, 11.784815 , 16.53431108, 12.53373232, 17.2224187 , 19.4773929 , 16.01115736, 18.56379706, 12.06891287, 18.30002795, 14.25171609, 19.59597679, 19.37950461, 18.32794218, 19.29994413, 12.26345665, 13.84847453, 12.25128025, 12.38791686, 19.48212198, 15.27397732, 18.1357253 ])

`df_crashes['pred_sv'] = model_sv.predict(X=features)`

`model_sv.score(X=features, y=target)`

0.7083438012012769

`dic_scores['sv'] = model_sv.score(X=features, y=target)`

The best model is the Random Forest with a Score of 0.95:

`pd.Series(dic_scores).sort_values(ascending=False)`

rf 0.954947 lr 0.726949 sv 0.708344 dtype: float64

Let's put the following data:

`df_crashes[['accidents', 'pred_lr', 'pred_rf', 'pred_sv']]`

Into a plot:

`sns.scatterplot(x='alcohol', y='accidents', data=df_crashes, label='Real Data')sns.scatterplot(x='alcohol', y='pred_lr', data=df_crashes, label='Linear Regression')sns.lineplot(x='alcohol', y='pred_lr', data=df_crashes, color='orange')sns.scatterplot(x='alcohol', y='pred_rf', data=df_crashes, label='Random Forest')sns.scatterplot(x='alcohol', y='pred_sv', data=df_crashes, label='Support Vector Machines');`

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Ask him any doubt on **Twitter** or **LinkedIn**

Look at the following example as an aspiration you can achieve if you fully understand and replicate this whole tutorial with your data.

Let's load a dataset that contains information from transactions in tables (rows) at a restaurant considering socio-demographic and economic variables (columns).

`import seaborn as snsdf_tips = sns.load_dataset('tips')df_tips`

Grouping data to summarise the information helps you identify conclusions. For example, the summary below shows that **Dinners on Sundays** come to the best customers because they:

Spend more on average ($21.41)

Give more tips on average ($3.25)

Come more people at the same table on average ($2.84)

`df_tips.groupby(by=['day', 'time'])\ .mean()\ .fillna(0)\ .style.format('{:.2f}').background_gradient(axis=0)`

`df_tips.groupby(by=['day', 'time'])\ .mean()\ .fillna(0)\ .style.format('{:.2f}').bar(axis=0, width=50, align='zero')`

Let's dig into the details of the `.groupby()`

function from the basics in the following sections.

We use the `.groupby()`

function to generate an object that contains as many `DataFrames`

as categories are in the column.

`df_tips.groupby('sex')`

As we have two groups in sex (Female and Male), the length of the `DataFrameGroupBy`

object returned by the `groupby()`

function is 2:

`len(df_tips.groupby('sex'))`

How can we work with the object `DataFrameGroupBy`

?

We use the `.mean()`

function to get the average of the numerical columns for the two groups:

`df_tips.groupby('sex').mean()`

A pretty and simple syntax to summarise the information, right?

- But what's going on inside the
`DataFrameGroupBy`

object?

`df_tips.groupby('sex')`

`df_grouped = df_tips.groupby('sex')`

The `DataFrameGroupBy`

object contains 2 `DataFrames`

. To see one of them `DataFrame`

you need to use the function `.get_group()`

and pass the group whose `DataFrame`

you'd like to return:

`df_grouped.get_group('Male')`

`df_grouped.get_group('Female')`

Learn how to become an independent Data Analyist programmer who knows how to extract meaningful insights from Data Visualizations.

As the `DataFrameGroupBy`

distinguish the categories, at the moment we apply an aggregation function (click here to see a list of them), we will get the mathematical operations for those groups separately:

`df_grouped.mean()`

We could apply the function to each `DataFrame`

separately. Although *it is not the point of the* `.groupby()`

function.

`df_grouped.get_group('Male').mean(numeric_only=True)`

`df_grouped.get_group('Female').mean(numeric_only=True)`

To get the results for just 1 column of interest, we access the column:

`df_grouped.total_bill`

And use the aggregation function we wish, `.sum()`

in this case:

`df_grouped.total_bill.sum()`

We get the result for just 1 column (total_bill) because the `DataFrames`

generated at `.groupby()`

are accessed as if they were simple `DataFrames`

:

`df_grouped.get_group('Female')`

`df_grouped.get_group('Female').total_bill`

`df_grouped.get_group('Female').total_bill.sum()`

`df_grouped.get_group('Male').total_bill.sum()`

`df_grouped.total_bill.sum()`

So far, we have summarised the data based on the categories of just one column. But, what if we'd like to summarise the data **based on the combinations** of the categories within different categorical columns?

`df_tips.groupby(by=['day', 'smoker']).sum()`

We could have also used another function `.pivot_table()`

to get the same numbers:

`df_tips.pivot_table(index='day', columns='smoker', aggfunc='sum')`

Which one is best?

- I leave it up to your choice; I'd prefer to use the
`.pivot_table()`

because the syntax makes it more accessible.

The thing doesn't stop here; we can even compute different aggregation functions at the same time:

`df_tips.groupby(by=['day', 'smoker'])\ .total_bill\ .agg(func=['sum', 'mean'])`

`df_tips.pivot_table(index='day', columns='smoker', values='total_bill', aggfunc=['sum', 'mean'])`

`dfres = df_tips.pivot_table(index='day', columns='smoker', values='total_bill', aggfunc=['sum', 'mean'])`

You could even style the output `DataFrame`

:

`dfres.style.background_gradient()`

For me, it's nicer than styling the `.groupby()`

returned DataFrame.

As we say in Spain:

Pa' gustos los colores!

`df_tips.groupby(by=['day', 'smoker']).total_bill.agg(func=['sum', 'mean'])`

`dfres = df_tips.groupby(by=['day', 'smoker']).total_bill.agg(func=['sum', 'mean'])`

`dfres.style.background_gradient()`

We can compute more than one mathematical operation:

`df_tips.pivot_table(index='sex', columns='time', aggfunc=['sum', 'mean'], values='total_bill')`

And use more than one column in each of the parameters:

`df_tips.pivot_table(index='sex', columns='time', aggfunc=['sum', 'mean'], values=['total_bill', 'tip'])`

`df_tips.pivot_table(index=['day', 'smoker'], columns='time', aggfunc=['sum', 'mean'], values=['total_bill', 'tip'])`

`df_tips.pivot_table(index=['day', 'smoker'], columns=['time', 'sex'], aggfunc=['sum', 'mean'], values=['total_bill', 'tip'])`

`.size()`

Function`.groupby()`

The `.size()`

is a function used to count the number of rows (observations) in each of the `DataFrames`

generated by `.groupby()`

.

`df_grouped.size()`

`df_tips.groupby(by=['sex', 'time']).size()`

`.pivot_table()`

We can use `.pivot_table()`

to represent the data clearer:

`df_tips.pivot_table(index='sex', columns='time', aggfunc='size')`

`df_tips.pivot_table(index='smoker', columns=['day', 'sex'],aggfunc='size')`

`dfres = df_tips.pivot_table(index='smoker', columns=['day', 'sex'], aggfunc='size')`

`dfres.style.background_gradient()`

`df_tips.pivot_table(index=['day', 'time'], columns=['smoker', 'sex'], aggfunc='size')`

`dfres = df_tips.pivot_table(index=['day', 'time'], columns=['smoker', 'sex'], aggfunc='size')`

`dfres.style.background_gradient()`

We can even choose the way we'd like to gradient colour the cells:

`axis=1`

: the upper value between the columns of the same row`axis=2`

: the upper value between the rows of the same column

`dfres.style.background_gradient(axis=1)`

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

]]>Ask him any doubt on **Twitter** or **LinkedIn**

The following image is pretty self-explanatory to understand how APIs work:

- The API is the waiter who
- Takes the request from the clients
- And take them to the kitchen
- To later serve the "cooked" response back to the clients

The URL is an address we use to locate files on the Internet:

- Documents: pdf, ppt, docx,...
- Multimedia: mp4, mp3, mov, png, jpeg,...
- Data Files: csv, json, db,...

Check out the following gif where we inspect the resources we download when locating https://economist.com.

URL - Watch Video

An Application Program Interface (API) is a communications tool between the client and the server to carry out information through an URL.

The API defines the rules by which the URL will work. Like Python, the API contains:

- Functions
- Parameters
- Accepted Values

The only extra knowledge we need to consider is the use of **tokens**.

A token is a code you use in the request to validate your identity, as most platforms charge money to use their API.

`token = 'PASTE_YOUR_TOKEN_HERE'`

In the website documentation.

`'https://www.alphavantage.co/query?function=TIME_SERIES_INTRADAY&symbol=IBM&interval=5min&apikey=demo'`

Every time you make a **call to an API** requesting some information, you later receive a **response**.

Check this JSON, a type of file that stores structured data returned by the API.

If you want to know more about the JSON file, see article.

- Base API:
`https://www.alphavantage.co/query?`

- Parameters:
`function=TIME_SERIES_INTRADAY`

`symbol=IBM`

`interval=5min`

`apikey=demo`

`import requestsapi_call = 'https://www.alphavantage.co/query?function=TIME_SERIES_INTRADAY&symbol=IBM&interval=5min&apikey=demo'requests.get(url=api_call)`

`>>> <Response [200]>`

`res = requests.get(url=api_call)`

The function returns an object containing all the information related to the **API request and response**.

`res.apparent_encoding`

`>>> 'ascii'`

`res.headers`

`>>> {'Date': 'Mon, 18 Jul 2022 18:01:19 GMT', 'Content-Type': 'application/json', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Cookie', 'X-Frame-Options': 'SAMEORIGIN', 'Allow': 'GET, HEAD, OPTIONS', 'Via': '1.1 vegur', 'CF-Cache-Status': 'DYNAMIC', 'Expect-CT': 'max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"', 'Server': 'cloudflare', 'CF-RAY': '72cd1f3959323851-MAD', 'Content-Encoding': 'gzip'}`

`res.history`

`>>> []`

To place the response object into a Python interpretable object, we need to use the function `.json()`

to get a dictionary with the data.

`res.json()`

`>>> {'Meta Data': {'1. Information': 'Intraday (5min) open, high, low, close prices and volume', '2. Symbol': 'IBM', '3. Last Refreshed': '2022-06-29 19:25:00', '4. Interval': '5min', '5. Output Size': 'Compact', '6. Time Zone': 'US/Eastern'}, 'Time Series (5min)': {'2022-06-29 19:25:00': {'1. open': '140.7100', '2. high': '140.7100', '3. low': '140.7100', '4. close': '140.7100', '5. volume': '531'}, ... '2022-06-28 17:25:00': {'1. open': '142.1500', '2. high': '142.1500', '3. low': '142.1500', '4. close': '142.1500', '5. volume': '100'}}}`

`data = res.json()`

The data in the dictionary represents the symbol **IBM** in intervals of **5min** for the **TIME_SERIES_INTRADAY**.

Check the dictionary above to confirm.

`res.request.path_url`

`>>> '/query?function=TIME_SERIES_INTRADAY&symbol=IBM&interval=5min&apikey=demo'`

We need to change the value of the parameter `symbol`

within the URL we use to call the API:

`stock = 'AAPL'api_call = f'https://www.alphavantage.co/query?function=TIME_SERIES_INTRADAY&symbol={stock}&interval=5min&apikey=demo'res = requests.get(url=api_call)res.json()`

`>>> {'Information': 'The **demo** API key is for demo purposes only. Please claim your free API key at (https://www.alphavantage.co/support/#api-key) to explore our full API offerings. It takes fewer than 20 seconds.'}`

The API returns a JSON which implicitly says we previously used a ***demo** API key* to retrieve data from the symbol IBM. Nevertheless, using the same demo API key to retrieve the AAPL stock data is impossible.

We should include our token in the API call:

`token`

`>>> 'YOUR_PASTED_TOKEN_ABOVE'`

`api_call = f'https://www.alphavantage.co/query?function=TIME_SERIES_INTRADAY&symbol={stock}&interval=5min&apikey={token}'res = requests.get(url=api_call)data = res.json()data`

`>>> {'Meta Data': {'1. Information': 'Intraday (5min) open, high, low, close prices and volume', '2. Symbol': 'AAPL', '3. Last Refreshed': '2022-07-15 20:00:00', '4. Interval': '5min', '5. Output Size': 'Compact', '6. Time Zone': 'US/Eastern'}, 'Time Series (5min)': {'2022-06-29 19:25:00': {'1. open': '140.7100', '2. high': '140.7100', '3. low': '140.7100', '4. close': '140.7100', '5. volume': '531'}, ... '2022-06-28 17:25:00': {'1. open': '142.1500', '2. high': '142.1500', '3. low': '142.1500', '4. close': '142.1500', '5. volume': '100'}}}`

`data`

? Why?`data`

contains a dictionary, which it's a very simple Python object.

`data.sum()`

`>>>---------------------------------------------------------------------------AttributeError Traceback (most recent call last)Input In [46], in <cell line: 1>()----> 1 data.sum()AttributeError: 'dict' object has no attribute 'sum'`

We need to create a `DataFrame`

out of this dictionary to have a powerful object we could use to apply many functions.

`import dataframe_image as dfi`

`import pandas as pdpd.DataFrame(data=data)`

We'd like to have the open, high, close,... variables as the columns. Not `Meta Data`

and `Time Series (5min)`

. Why is this happening?

`Meta Data`

and`Time Series (5min)`

are the`keys`

of the dictionary`data`

.- The value of the key
`Time Series (5min)`

key is the information we want in the DataFrame.

`data['Time Series (5min)']`

`>>> {'2022-07-15 20:00:00': {'1. open': '150.0300', '2. high': '150.0700', '3. low': '150.0300', '4. close': '150.0300', '5. volume': '4752'}, ... '2022-06-28 17:25:00': {'1. open': '142.1500', '2. high': '142.1500', '3. low': '142.1500', '4. close': '142.1500', '5. volume': '100'}`

`pd.DataFrame(data['Time Series (5min)'])`

`df_apple = pd.DataFrame(data['Time Series (5min)'])`

The `DataFrame`

is not represented as we'd like because the Dates are in the columns and the variables are in the index. So which function can we use to transpose the `DataFrame`

?

`df_apple.transpose()`

`df_apple = df_apple.transpose()`

Let's get the average value from the close price:

`df_apple['4. close']`

`>>> 2022-07-15 20:00:00 150.0300 2022-07-15 19:55:00 150.0700 ... 2022-07-15 11:45:00 149.1500 2022-07-15 11:40:00 149.1100 Name: 4. close, Length: 100, dtype: object`

`df_apple['4. close'].mean()`

`>>>---------------------------------------------------------------------------ValueError Traceback (most recent call last)File ~/miniforge3/lib/python3.9/site-packages/pandas/core/nanops.py:1622, in _ensure_numeric(x) 1621 try:-> 1622 x = float(x) 1623 except (TypeError, ValueError): 1624 # e.g. "1+1j" or "foo"ValueError: could not convert string to float: '150.0300150.0700150.0400150.0100150.0300150.0500149.9900149.9900149.9800149.9900150.0000149.9900150.0000149.9900150.0000149.9800150.0000150.0100150.0500150.0100150.0100150.0000150.0200150.0100150.0100150.0098150.0100150.0000150.0200150.0000150.0007150.0100150.0100150.0200150.0325150.0200150.0300150.0200150.0000150.0300150.0001150.0000150.0000150.0100150.0560150.0500150.0900150.1700149.8900149.4410149.5300149.2700149.2160149.2094149.2000149.3450149.3778149.5450149.3600149.3500149.4700149.5400149.3993149.2150149.3015149.4100149.2916149.2650149.1200149.0400148.9800149.1350148.8800149.1850149.3924149.4600149.3496149.3250149.0874149.0600149.0000149.0101148.9350148.9100148.8620149.0050148.8100148.6340148.5500148.7600148.6950148.6800148.5488148.3500148.7351148.7910148.9305149.2000149.1500149.1100'During handling of the above exception, another exception occurred:ValueError Traceback (most recent call last)File ~/miniforge3/lib/python3.9/site-packages/pandas/core/nanops.py:1626, in _ensure_numeric(x) 1625 try:-> 1626 x = complex(x) 1627 except ValueError as err: 1628 # e.g. "foo"ValueError: complex() arg is a malformed stringThe above exception was the direct cause of the following exception:TypeError Traceback (most recent call last)Input In [38], in <cell line: 1>()----> 1 df_apple['4. close'].mean()File ~/miniforge3/lib/python3.9/site-packages/pandas/core/generic.py:11117, in NDFrame._add_numeric_operations.<locals>.mean(self, axis, skipna, level, numeric_only, **kwargs) 11099 @doc( 11100 _num_doc, 11101 desc="Return the mean of the values over the requested axis.", (...) 11115 **kwargs, 11116 ):> 11117 return NDFrame.mean(self, axis, skipna, level, numeric_only, **kwargs)File ~/miniforge3/lib/python3.9/site-packages/pandas/core/generic.py:10687, in NDFrame.mean(self, axis, skipna, level, numeric_only, **kwargs) 10679 def mean( 10680 self, 10681 axis: Axis | None | lib.NoDefault = lib.no_default, (...) 10685 **kwargs, 10686 ) -> Series | float:> 10687 return self._stat_function( 10688 "mean", nanops.nanmean, axis, skipna, level, numeric_only, **kwargs 10689 )File ~/miniforge3/lib/python3.9/site-packages/pandas/core/generic.py:10639, in NDFrame._stat_function(self, name, func, axis, skipna, level, numeric_only, **kwargs) 10629 warnings.warn( 10630 "Using the level keyword in DataFrame and Series aggregations is " 10631 "deprecated and will be removed in a future version. Use groupby " (...) 10634 stacklevel=find_stack_level(), 10635 ) 10636 return self._agg_by_level( 10637 name, axis=axis, level=level, skipna=skipna, numeric_only=numeric_only 10638 )> 10639 return self._reduce( 10640 func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only 10641 )File ~/miniforge3/lib/python3.9/site-packages/pandas/core/series.py:4471, in Series._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds) 4467 raise NotImplementedError( 4468 f"Series.{name} does not implement {kwd_name}." 4469 ) 4470 with np.errstate(all="ignore"):-> 4471 return op(delegate, skipna=skipna, **kwds)File ~/miniforge3/lib/python3.9/site-packages/pandas/core/nanops.py:93, in disallow.__call__.<locals>._f(*args, **kwargs) 91 try: 92 with np.errstate(invalid="ignore"):---> 93 return f(*args, **kwargs) 94 except ValueError as e: 95 # we want to transform an object array 96 # ValueError message to the more typical TypeError 97 # e.g. this is normally a disallowed function on 98 # object arrays that contain strings 99 if is_object_dtype(args[0]):File ~/miniforge3/lib/python3.9/site-packages/pandas/core/nanops.py:155, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds) 153 result = alt(values, axis=axis, skipna=skipna, **kwds) 154 else:--> 155 result = alt(values, axis=axis, skipna=skipna, **kwds) 157 return resultFile ~/miniforge3/lib/python3.9/site-packages/pandas/core/nanops.py:410, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs) 407 if datetimelike and mask is None: 408 mask = isna(values)--> 410 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs) 412 if datetimelike: 413 result = _wrap_results(result, orig_values.dtype, fill_value=iNaT)File ~/miniforge3/lib/python3.9/site-packages/pandas/core/nanops.py:698, in nanmean(values, axis, skipna, mask) 695 dtype_count = dtype 697 count = _get_counts(values.shape, mask, axis, dtype=dtype_count)--> 698 the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum)) 700 if axis is not None and getattr(the_sum, "ndim", False): 701 count = cast(np.ndarray, count)File ~/miniforge3/lib/python3.9/site-packages/pandas/core/nanops.py:1629, in _ensure_numeric(x) 1626 x = complex(x) 1627 except ValueError as err: 1628 # e.g. "foo"-> 1629 raise TypeError(f"Could not convert {x} to numeric") from err 1630 return xTypeError: Could not convert 150.0300150.0700150.0400150.0100150.0300150.0500149.9900149.9900149.9800149.9900150.0000149.9900150.0000149.9900150.0000149.9800150.0000150.0100150.0500150.0100150.0100150.0000150.0200150.0100150.0100150.0098150.0100150.0000150.0200150.0000150.0007150.0100150.0100150.0200150.0325150.0200150.0300150.0200150.0000150.0300150.0001150.0000150.0000150.0100150.0560150.0500150.0900150.1700149.8900149.4410149.5300149.2700149.2160149.2094149.2000149.3450149.3778149.5450149.3600149.3500149.4700149.5400149.3993149.2150149.3015149.4100149.2916149.2650149.1200149.0400148.9800149.1350148.8800149.1850149.3924149.4600149.3496149.3250149.0874149.0600149.0000149.0101148.9350148.9100148.8620149.0050148.8100148.6340148.5500148.7600148.6950148.6800148.5488148.3500148.7351148.7910148.9305149.2000149.1500149.1100 to numeric`

Why are we getting this ugly error?

- The values of the
`Series`

aren't numerical objects.

`df_apple.dtypes`

`>>> 1. open object 2. high object 3. low object 4. close object 5. volume object dtype: object`

Can you change the type of the values into numerical objects?

`df_apple = df_apple.apply(pd.to_numeric)`

Now that we have the `Series`

values as numerical objects:

`df_apple.dtypes`

`>>> 1. open float64 2. high float64 3. low float64 4. close float64 5. volume int64 dtype: object`

We should be able to get the average close price:

`df_apple['4. close'].mean()`

`>>> 149.551566`

What else could we do?

`df_apple.hist();`

`df_apple.hist(layout=(2,3), figsize=(15,8));`

`token = 'PASTE_YOUR_TOKEN_HERE'stock = 'AAPL'api_call = f'https://www.alphavantage.co/query?function=TIME_SERIES_INTRADAY&symbol={stock}&interval=5min&apikey={token}'res = requests.get(url=api_call)data = res.json()df_apple = pd.DataFrame(data=data['Time Series (5min)'])df_apple = df_apple.transpose()df_apple = df_apple.apply(pd.to_numeric)df_apple.hist(layout=(2,3), figsize=(15,8));`

`size='full'info_type = 'TIME_SERIES_DAILY'api_call = f'https://www.alphavantage.co/query?function={info_type}&symbol={stock}&outputsize={size}&apikey={token}'res = requests.get(url=api_call)data = res.json()df_apple_daily = pd.DataFrame(data['Time Series (Daily)'])df_apple_daily = df_apple_daily.transpose()df_apple_daily = df_apple_daily.apply(pd.to_numeric)df_apple_daily.index = pd.to_datetime(df_apple_daily.index)df_apple_daily.plot.line(layout=(2,3), figsize=(15,8), subplots=True);`

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Programming is all about working with data.

We can work with many types of data structures. Nevertheless, the pandas DataFarme is the most useful because it contains functions that automate a lot of work by writing a simple line of code.

This tutorial will teach you how to work with the `pandas.DataFrame`

object.

Before, we will demonstrate why working with simple Arrays (what most people do) makes your life more difficult than it should be.

An array is any object that can store **more than one object**. For example, the `list`

:

`[100, 134, 87, 99]`

Let's say we are talking about the revenue our e-commerce has had over the last 4 months:

`list_revenue = [100, 134, 87, 99]`

We want to calculate the total revenue (i.e., we sum up the objects within the list):

`list_revenue.sum()`

`---------------------------------------------------------------------------AttributeError Traceback (most recent call last)Input In [3], in <cell line: 1>()----> 1 list_revenue.sum()AttributeError: 'list' object has no attribute 'sum'`

The list is a *poor* object which doesn't contain powerful functions.

What can we do then?

We convert the list to a powerful object such as the `Series`

, which comes from `pandas`

library.

`import pandaspandas.Series(list_revenue)`

`>>>0 1001 1342 873 99dtype: int64`

`series_revenue = pandas.Series(list_revenue)`

Now we have a powerful object that can perform the `.sum()`

:

`series_revenue.sum()`

`>>> 420`

Within the Series, we can find more objects.

`series_revenue`

`>>>0 1001 1342 873 99dtype: int64`

`series_revenue.index`

`>>> RangeIndex(start=0, stop=4, step=1)`

Let's change the elements of the index:

`series_revenue.index = ['1st Month', '2nd Month', '3rd Month', '4th Month']`

`series_revenue`

`>>>1st Month 1002nd Month 1343rd Month 874th Month 99dtype: int64`

`series_revenue.values`

`>>> array([100, 134, 87, 99])`

`series_revenue.name`

The `Series`

doesn't contain a name. Let's define it:

`series_revenue.name = 'Revenue'`

`series_revenue`

`>>>1st Month 1002nd Month 1343rd Month 874th Month 99Name: Revenue, dtype: int64`

The values of the Series (right-hand side) are determined by their **data type** (alias `dtype`

):

`series_revenue.dtype`

`>>> dtype('float64')`

Let's change the values' dtype to be `float`

(decimal numbers)

`series_revenue.astype(float)`

`>>>1st Month 100.02nd Month 134.03rd Month 87.04th Month 99.0Name: Revenue, dtype: float64`

`series_revenue = series_revenue.astype(float)`

What else could we do with the Series object?

`series_revenue.describe()`

`>>>count 4.000000mean 105.000000std 20.215506min 87.00000025% 96.00000050% 99.50000075% 108.500000max 134.000000Name: Revenue, dtype: float64`

`series_revenue.plot.bar();`

`series_revenue.plot.barh();`

`series_revenue.plot.pie();`

The `DataFrame`

is a set of Series.

We will create another Series `series_expenses`

to later put them together into a DataFrame.

`pandas.Series( data=[20, 23, 21, 18], index=['1st Month','2nd Month','3rd Month','4th Month'], name='Expenses')`

`>>>1st Month 202nd Month 233rd Month 214th Month 18Name: Expenses, dtype: int64`

`series_expenses = pandas.Series( data=[20, 23, 21, 18], index=['1st Month','2nd Month','3rd Month','4th Month'], name='Expenses')`

`pandas.DataFrame(data=[series_revenue, series_expenses])`

`df_shop = pandas.DataFrame(data=[series_revenue, series_expenses])`

Let's transpose the DataFrame to have the variables in columns:

`df_shop.transpose()`

`df_shop = df_shop.transpose()`

`df_shop.index`

`>>> Index(['1st Month', '2nd Month', '3rd Month', '4th Month'], dtype='object')`

`df_shop.columns`

`>>> Index(['Revenue', 'Expenses'], dtype='object')`

`df_shop.values`

`>>>array([[100., 20.], [134., 23.], [ 87., 21.], [ 99., 18.]])`

`df_shop.shape`

`>>> (4, 2)`

What else could we do with the DataFrame object?

`df_shop.describe()`

`df_shop.plot.bar();`

`df_shop.plot.pie(subplots=True);`

`df_shop.plot.line();`

`df_shop.plot.area();`

We could also export the DataFrame to formatted data files:

`df_shop.to_excel('data.xlsx')`

`df_shop.to_csv('data.csv')`

`url = 'https://raw.githubusercontent.com/jsulopzs/data/main/football_players_stats.json'pandas.read_json(url, orient='index')`

`df_football = pandas.read_json(url, orient='index')`

`df_football.Goals.plot.pie();`

`url = 'https://raw.githubusercontent.com/jsulopzs/data/main/best_tennis_players_stats.json'pandas.read_json(path_or_buf=url, orient='index')`

`df_tennis = pandas.read_json(path_or_buf=url, orient='index')`

`df_tennis.style.background_gradient()`

`df_tennis.plot.pie(subplots=True, layout=(2,3), figsize=(10,6));`

`pandas.read_html('https://www.skysports.com/la-liga-table/2021', index_col='Team')[0]`

`df_laliga = pandas.read_html('https://www.skysports.com/la-liga-table/2021', index_col='Team')[0]`

`df_laliga.Pts.plot.barh();`

`df_laliga.Pts.sort_values().plot.barh();`

`url = 'https://raw.githubusercontent.com/jsulopzs/data/main/internet_usage_spain.csv'pandas.read_csv(filepath_or_buffer=url)`

`df_internet = pandas.read_csv(filepath_or_buffer=url)`

`df_internet.hist();`

`df_internet.pivot_table(index='education', columns='internet_usage', aggfunc='size')`

`dfres = df_internet.pivot_table(index='education', columns='internet_usage', aggfunc='size')`

`dfres.style.background_gradient('Greens', axis=1)`

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Don't miss out on his posts on **LinkedIn** to become a more efficient Python developer.

Machine Learning is a field that focuses on **getting a mathematical equation** to make predictions. Although not all Machine Learning models work the same way.

Which types of Machine Learning models can we distinguish so far?

**Classifiers**to predict**Categorical Variables****Regressors**to predict**Numerical Variables**

The previous chapter covered the explanation of a Regressor model: Linear Regression.

This chapter covers the explanation of a Classification model: the Decision Tree.

Why do they belong to Machine Learning?

The Machine wants to get the best numbers of a mathematical equation such that

**the difference between reality and predictions is minimum**:**Classifier**evaluates the model based on**prediction success rate**y=?y^**Regressor**evaluates the model based on the**distance between real data and predictions**(residuals) yy^

There are many Machine Learning Models of each type.

You don't need to know the process behind each model because they all work the same way (see article). In the end, you will choose the one that makes better predictions.

This tutorial will show you how to develop a Decision Tree to calculate the probability of a person surviving the Titanic and the different evaluation metrics we can calculate on Classification Models.

**Table of Important Content**

馃泙 How to preprocess/clean the data to fit a Machine Learning model?

Dummy Variables

Missing Data

馃ぉ How to

**visualize**a Decision Tree model in Python step by step?馃 How to

**interpret**the nodes and leaf's values of a Decision Tree plot?How to

**evaluate**Classification models?Sensitivity

Specificity

ROC Curve

馃弫 How to compare Classification models to choose the best one?

This dataset represents

**people**(rows) aboard the TitanicAnd their

**sociological characteristics**(columns)

`import seaborn as sns #!import pandas as pddf_titanic = sns.load_dataset(name='titanic')[['survived', 'sex', 'age', 'embarked', 'class']]df_titanic`

We should know from the previous chapter that we need a function accessible from a Class in the library `sklearn`

.

`from sklearn.tree import DecisionTreeClassifier`

To create a copy of the original's code blueprint to not "modify" the source code.

`model_dt = DecisionTreeClassifier()`

The theoretical action we'd like to perform is the same as we executed in the previous chapter. Therefore, the function should be called the same way:

`model_dt.fit()`

---------------------------------------------------------------------------

TypeError Traceback (most recent call last)

/var/folders/24/tg28vxls25l9mjvqrnh0plc80000gn/T/ipykernel_3553/3699705032.py in ----> 1 model_dt.fit()

TypeError: fit() missing 2 required positional arguments: 'X' and 'y'

Why is it asking for two parameters: `y`

and `X`

?

`y`

: target ~ independent ~ label ~ class variable`X`

: explanatory ~ dependent ~ feature variables

`target = df_titanic['survived']explanatory = df_titanic.drop(columns='survived')`

`model_dt.fit(X=explanatory, y=target)`

---------------------------------------------------------------------------

ValueError: could not convert string to float: 'male'

Most of the time, the data isn't prepared to fit the model. So let's dig into why we got the previous error in the following sections.

The error says:

`ValueError: could not convert string to float: 'male'`

From which we can interpret that the function `.fit()`

does **not accept values of** `string`

type like the ones in `sex`

column:

`df_titanic`

Therefore, we need to convert the categorical columns to **dummies** (0s & 1s):

`pd.get_dummies(df_titanic, drop_first=True)`

`df_titanic = pd.get_dummies(df_titanic, drop_first=True)`

We separate the variables again to take into account the latest modification:

`explanatory = df_titanic.drop(columns='survived')target = df_titanic[['survived']]`

Now we should be able to fit the model:

`model_dt.fit(X=explanatory, y=target)`

---------------------------------------------------------------------------

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

The data passed to the function contains **missing data** (`NaN`

). Precisely 177 people from which we don't have the age:

`df_titanic.isna()`

`df_titanic.isna().sum()`

survived 0 age 177 sex_male 0 embarked_Q 0 embarked_S 0 class_Second 0 class_Third 0 dtype: int64

Who are the people who lack the information?

`mask_na = df_titanic.isna().sum(axis=1) > 0`

`df_titanic[mask_na]`

What could we do with them?

Drop the people (rows) who miss the age from the dataset.

Fill the age by the average age of other combinations (like males who survived)

Apply an algorithm to fill them.

We'll choose **option 1 to simplify the tutorial**.

Therefore, we go from 891 people:

`df_titanic`

To 714 people:

`df_titanic.dropna()`

`df_titanic = df_titanic.dropna()`

We separate the variables again to take into account the latest modification:

`explanatory = df_titanic.drop(columns='survived')target = df_titanic['survived']`

Now we shouldn't have any more trouble with the data to fit the model.

We don't get any errors because we correctly preprocess the data for the model.

Once the model is fitted, we may observe that the object contains more attributes because it has calculated the best numbers for the mathematical equation.

`model_dt.fit(X=explanatory, y=target)model_dt.__dict__`

{'criterion': 'gini', 'splitter': 'best', 'max_depth': None, 'min_samples_split': 2, 'min_samples_leaf': 1, 'min_weight_fraction_leaf': 0.0, 'max_features': None, 'max_leaf_nodes': None, 'random_state': None, 'min_impurity_decrease': 0.0, 'class_weight': None, 'ccp_alpha': 0.0, 'feature_names_in_': array(['age', 'sex_male', 'embarked_Q', 'embarked_S', 'class_Second', 'class_Third'], dtype=object), 'n_features_in_': 6, 'n_outputs_': 1, 'classes_': array([0, 1]), 'n_classes_': 2, 'max_features_': 6, 'tree_': <sklearn.tree._tree.Tree at 0x16612cce0>}

Learn how to become an independent Machine Learning programmer who knows when to apply any ML algorithm to any dataset.

We have a fitted `DecisionTreeClassifier`

. Therefore, we should be able to apply the mathematical equation to the original data to get the predictions:

`model_dt.predict_proba(X=explanatory)[:5]`

array([[0.82051282, 0.17948718], [0.05660377, 0.94339623], [0.53921569, 0.46078431], [0.05660377, 0.94339623], [0.82051282, 0.17948718]])

Let's create a new `DataFrame`

to keep the information of the target and predictions to understand the topic better:

`df_pred = df_titanic[['survived']].copy()`

And add the predictions:

`df_pred['pred_proba_dt'] = model_dt.predict_proba(X=explanatory)[:,1]df_pred`

How have we calculated those predictions?

The **Decision Tree** model doesn't specifically have a mathematical equation. But instead, a set of conditions is represented in a tree:

`from sklearn.tree import plot_treeplot_tree(decision_tree=model_dt);`

There are many conditions; let's recreate a shorter tree to explain the Mathematical Equation of the Decision Tree:

`model_dt = DecisionTreeClassifier(max_depth=2)model_dt.fit(X=explanatory, y=target)plot_tree(decision_tree=model_dt);`

Let's make the image bigger:

`import matplotlib.pyplot as pltplt.figure(figsize=(10,6))plot_tree(decision_tree=model_dt);`

The conditions are `X[2]<=0.5`

. The `X[2]`

means the 3rd variable (Python starts counting at 0) of the explanatory ones. If we'd like to see the names of the columns, we need to add the `feature_names`

parameter:

`explanatory.columns`

Index(['age', 'sex_male', 'embarked_Q', 'embarked_S', 'class_Second', 'class_Third'], dtype='object')

`import matplotlib.pyplot as pltplt.figure(figsize=(10,6))plot_tree(decision_tree=model_dt, feature_names=explanatory.columns);`

Let's add some colours to see how the predictions will go based on the fulfilled conditions:

`import matplotlib.pyplot as pltplt.figure(figsize=(10,6))plot_tree(decision_tree=model_dt, feature_names=explanatory.columns, filled=True);`

The Decision Tree and the Linear Regression algorithms look for the best numbers in a mathematical equation. The following video explains how the Decision Tree configures the equation:

Let's take a person from the data to explain how the model makes a prediction. For storytelling, let's say the person's name is John.

John is a 22-year-old man who took the titanic on 3rd class but didn't survive:

`df_titanic[:1]`

To calculate the chances of survival in a person like John, we pass the explanatory variables of John:

`explanatory[:1]`

To the function `.predict_proba()`

and get a probability of 17.94%:

`model_dt.predict_proba(X=explanatory[:1])`

array([[0.82051282, 0.17948718]])

But wait, how did we get to the probability of survival of 17.94%?

Let's explain it step-by-step with the Decision Tree visualization:

`plt.figure(figsize=(10,6))plot_tree(decision_tree=model_dt, feature_names=explanatory.columns, filled=True);`

Based on the tree, the conditions are:

- sex_male (John=1) <= 0.5 ~ False

John doesn't fulfil the condition; we move to the right side of the tree.

- age (John=22.0) <= 6.5 ~ False

John doesn't fulfil the condition; we move to the right side of the tree.

The ultimate node, the leaf, tells us that the training dataset contained 429 males older than 6.5 years old.

Out of the 429, 77 survived, but 352 didn't make it.

Therefore, the chances of John surviving according to our model are 77 divided by 429:

`77/429`

0.1794871794871795

We get the same probability; John had a 17.94% chance of surviving the Titanic accident.

As always, we should have a function to calculate the goodness of the model:

`model_dt.score(X=explanatory, y=target)`

0.8025210084033614

The model can correctly predict 80.25% of the people in the dataset.

What's the reasoning behind the model's evaluation?

As we saw earlier, the classification model calculates the probability for an event to occur. The function `.predict_proba()`

gives us two probabilities in the columns: people who didn't survive (0) and people who survived (1).

`model_dt.predict_proba(X=explanatory)[:5]`

array([[0.82051282, 0.17948718], [0.05660377, 0.94339623], [0.53921569, 0.46078431], [0.05660377, 0.94339623], [0.82051282, 0.17948718]])

We take the positive probabilities in the second column:

`df_pred['pred_proba_dt'] = model_dt.predict_proba(X=explanatory)[:, 1]`

At the time to compare reality (0s and 1s) with the predictions (probabilities), we need to turn probabilities higher than 0.5 into 1, and 0 otherwise.

`import numpy as npdf_pred['pred_dt'] = np.where(df_pred.pred_proba_dt > 0.5, 1, 0)df_pred`

The simple idea of the accuracy is to get the success rate on the classification: how many people do we get right?

We compare if the reality is equal to the prediction:

`comp = df_pred.survived == df_pred.pred_dtcomp`

0 True 1 True ...

889 False 890 True Length: 714, dtype: bool

If we sum the boolean Series, Python will take True as 1 and 0 as False to compute the number of correct classifications:

`comp.sum()`

573

We get the score by dividing the successes by all possibilities (the total number of people):

`comp.sum()/len(comp)`

0.8025210084033614

It is also correct to do the mean on the comparisons because it's the sum divided by the total. Observe how you get the same number:

`comp.mean()`

0.8025210084033614

But it's more efficient to calculate this metric with the function `.score()`

:

`model_dt.score(X=explanatory, y=target)`

0.8025210084033614

Can we think that our model is 80.25% of good and be happy with it?

- We should not because we might be interested in the accuracy of each class (survived or not) separately. But first, we need to compute the confusion matrix:

`from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplaycm = confusion_matrix( y_true=df_pred.survived, y_pred=df_pred.pred_dt)CM = ConfusionMatrixDisplay(cm)CM.plot();`

Looking at the first number of the confusion matrix, we have 407 people who didn't survive the Titanic in reality and the predictions.

It is not the case with the number 17. Our model classified 17 people as survivors when they didn't.

The success rate of the negative class, people who didn't survive, is called the

**specificity**: $407/(407+17)$.Whereas the success rate of the positive class, people who did survive, is called the

**sensitivity**: $166/(166+124)$.

`cm[0,0]`

407

`cm[0,:]`

array([407, 17])

`cm[0,0]/cm[0,:].sum()`

0.9599056603773585

`sensitivity = cm[0,0]/cm[0,:].sum()`

`cm[1,1]`

166

`cm[1,:]`

array([124, 166])

`cm[1,1]/cm[1,:].sum()`

0.5724137931034483

`sensitivity = cm[1,1]/cm[1,:].sum()`

We could have gotten the same metrics using the function `classification_report()`

. Look a the recall (column) of rows 0 and 1, specificity and sensitivity, respectively:

`from sklearn.metrics import classification_reportreport = classification_report( y_true=df_pred.survived, y_pred=df_pred.pred_dt)print(report)`

precision recall f1-score support

0 0.77 0.96 0.85 424 1 0.91 0.57 0.70 290

accuracy 0.80 714 macro avg 0.84 0.77 0.78 714 weighted avg 0.82 0.80 0.79 714

We can also create a nice `DataFrame`

to later use the data for simulations:

`report = classification_report( y_true=df_pred.survived, y_pred=df_pred.pred_dt, output_dict=True)pd.DataFrame(report)`

Our model is not as good as we thought if we predict the people who survived; we get 57.24% of survivors right.

How can we then assess a reasonable rate for our model?

Watch the following video to understand how the Area Under the Curve (AUC) is a good metric because it sort of combines accuracy, specificity and sensitivity:

We compute this metric in Python as follows:

`import matplotlib.pyplot as pltimport numpy as npfrom sklearn import metricsy = df_pred.survivedpred = model_dt.predict_proba(X=explanatory)[:,1]fpr, tpr, thresholds = metrics.roc_curve(y, pred)roc_auc = metrics.auc(fpr, tpr)display = metrics.RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc, estimator_name='example estimator')display.plot()plt.show()`

`roc_auc`

0.8205066688353937

Let's build other classification models by applying the same functions. In the end, computing Machine Learning models is the same thing all the time.

`RandomForestClassifier()`

in Python`from sklearn.ensemble import RandomForestClassifiermodel_rf = RandomForestClassifier()model_rf.fit(X=explanatory, y=target)`

RandomForestClassifier()

`df_pred['pred_rf'] = model_rf.predict(X=explanatory)df_pred`

`model_rf.score(X=explanatory, y=target)`

0.9117647058823529

`SVC()`

in Python`from sklearn.svm import SVCmodel_sv = SVC()model_sv.fit(X=explanatory, y=target)`

SVC()

`df_pred['pred_sv'] = model_sv.predict(X=explanatory)df_pred`

`model_sv.score(X=explanatory, y=target)`

0.6190476190476191

To simplify the explanation, we use accuracy as the metric to compare the models. We have the Random Forest as the best model with an accuracy of 91.17%.

`model_dt.score(X=explanatory, y=target)`

0.8025210084033614

`model_rf.score(X=explanatory, y=target)`

0.9117647058823529

`model_sv.score(X=explanatory, y=target)`

0.6190476190476191

`df_pred.head(10)`

What do I need to start a

`Django`

project?

- It is recommended that you create a new environment
- And that
**you have**. If not, click here to download & install`Anaconda`

installed You need to install the library in your

`terminal`

**(use Anaconda Prompt for Windows Users)**:`conda create -n django_env django conda activate django_env`

Ok, you got it. What's next?

**Open a Code Editor application**to start working more comfortable with the project- I use Visual Studio Code (aka VSCode), you may download & install it here

What should I do within VSCode?

- You will use the
`Django CLI`

installed with the`Django`

package already - To
**create the standard folders and files**you need for the application Type the following line within the

`terminal`

:`django-admin startproject shop`

What should I see on my computer after this?

- If you open your
`user folder`

, you will see that - A folder
`shop`

has been created `drag & drop`

it to VSCode- Now check the folder structure and familiarize yourself with the files & folders
- The folder structure should look like this

`- shop/ - manage.py - shop/ - __init__.py - settings.py - urls.py - asgi.py - wsgi.py`

Do I need to study all of them?

- No, just go with the flow, and you'll get to understand everything at the end

Ok, what's the next step?

- You'll probably want to see your Django App up and running, right?
- Then, go over the
`terminal`

and write the following

`cd shoppython manage.py runserver`

- A local server has opened in http://127.0.0.1:8000/, open it in a
`web browser`

- Which references the
`localhost`

and you should see something like this

What if I try another

`URL`

like http://127.0.0.1:8000/products?

- You will receive an
**error**because - You didn't tell Django what to do when you go to http://127.0.0.1:8000/products

How can I tell that to Django?

`shop`

Django ProjectWith the following line of code

`python manage.py startup products`

- Create an URL within the file
`shop > urls.py`

`from django.contrib import adminfrom django.urls import path, include # modifiedurlpatterns = [ path('products/', include('products.urls')), # added path('admin/', admin.site.urls),]`

Create a

`View`

(HTML Code) to be recognised when you go to the`URL`

http://127.0.0.1:8000/productsWithin the file

`shop > products > views.py`

`from django.http import HttpResponsedef view_for_products(request): return HttpResponse("This function will render `HTML` code that makes you see this <p style='color: red'>text in red</p>.")`

See this tutorial if you want to know a bit more about

`HTML`

Call the function

`view_for_products`

when you click on http://127.0.0.1:8000/productsYou need to create the file

`urls.py`

within products`shop > products > urls.py`

`from django.urls import pathfrom . import viewsurlpatterns = [ path('', views.view_for_products, name='index'),]`

Why do we reference the

`URLs`

in two files? One in`shop/urls.py`

folder and the other in`products/urls.py`

?

- It is a best practice to have a
`Django`

project separated by different`Apps`

- In this case, we created the
`products`

App - In our the file
`shop/urls.py`

, you reference the`products.py`

URLs here

`urlpatterns = [ path('products/', include('products.urls')), #here path('admin/', admin.site.urls),]`

- So that at the time you navigate to
`https://127.0.0.1:8000/products`

- You will have access to the URLs defined in
`shop/products/urls.py`

- For example, let's create another View in
`shop/products/views.py`

`def new_view(request): return HttpResponse('This is the <strong>new view</strong>')`

- And reference it in the file
`shop/products/urls.py`

`from django.urls import pathfrom . import viewsurlpatterns = [ path('', views.view_for_products, name='index'), path('pepa', views.new_view, name='pepa'), # new url]`

- We don't need to reference the View in
`shop/urls.py`

since - we can access all URLs in
`shop/products/urls.py`

at the time we wrote `include('products.urls')`

in the file`shop/urls.py`

- Try to go to https://127.0.0.1:8000/products/pepa

So, each time I want to create a different

`HTML`

, do I need to create a View?

Yes, it's how the Model View Template (MVT) works

You introduce an

`URL`

- The
`URL`

activates a`View`

- And
`HTML`

code gets rendered in the website

Why don't you mention anything about the

`model`

?

- Well, that's something to cover in the following article 馃敟 COMING SOON!

Any doubts?

Let me know in the comments; I'd be happy to help!

]]>Ask him any doubt on **Twitter** or **LinkedIn**

We used just two variables out of the seven we had in the whole DataFrame.

We could have computed better cluster models by giving more information to the Machine Learning model. Nevertheless, it would have been **harder to plot seven variables with seven axes in a graph**.

Is there anything we can do to compute a clustering model with more than two variables and later represent all the points along with their variables?

- Yes, everything is possible with data. As one of my teachers told me: "you can torture the data until it gives you what you want" (sometimes it's unethical, so behave).

We'll develop the code to show you the need for **dimensionality reduction** techniques. Specifically, the Principal Component Analysis (PCA).

Imagine for a second you are the president of the United States of America, and you are considering creating campaigns to reduce **car accidents**.

You won't create 51 TV campaigns, one for each of the **States of the USA** (rows). Instead, you will see which States behave similarly to cluster them into 3 groups based on the variation across their features (columns).

`import seaborn as sns #!df_crashes = sns.load_dataset(name='car_crashes', index_col='abbrev')df_crashes`

Check this website to understand the measures of the following data.

From the previous chapter, we should know that we need to preprocess the Data so that variables with different scales can be compared.

For example, it is not the same to increase 1kg of weight than 1m of height.

We will use `StandardScaler()`

algorithm:

`from sklearn.preprocessing import StandardScalerscaler = StandardScaler()data_scaled = scaler.fit_transform(df_crashes)data_scaled[:5]`

`array([[ 0.73744574, 1.1681476 , 0.43993758, 1.00230055, 0.27769155, -0.58008306, 0.4305138 ], [ 0.56593556, 1.2126951 , -0.21131068, 0.60853209, 0.80725756, 0.94325764, -0.02289992], [ 0.68844283, 0.75670887, 0.18761539, 0.45935701, 1.03314134, 0.0708756 , -0.98177845], [ 1.61949811, -0.48361373, 0.54740815, 1.67605228, 1.95169961, -0.33770122, 0.32112519], [-0.92865317, -0.39952407, -0.8917629 , -0.594276 , -0.89196792, -0.04841772, 1.26617765]])`

Let's turn the array into a DataFrame for better understanding:

`import pandas as pddf_scaled = pd.DataFrame(data_scaled, index=df_crashes.index, columns=df_crashes.columns)df_scaled`

Now we see all the variables having the same scale (i.e., around the same limits):

`df_scaled.agg(['min', 'max'])`

We follow the usual Scikit-Learn procedure to develop Machine Learning models.

`from sklearn.cluster import KMeans`

`model_km = KMeans(n_clusters=3)`

`model_km.fit(X=df_scaled)`

`KMeans(n_clusters=3)`

`model_km.predict(X=df_scaled)`

`array([1, 1, 1, 1, 2, 0, 2, 1, 2, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 2, 2, 2, 2, 0, 1, 1, 0, 0, 0, 2, 0, 2, 1, 1, 0, 1, 0, 1, 2, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1], dtype=int32)`

`df_pred = df_scaled.copy()`

`df_pred.insert(0, 'pred', model_km.predict(X=df_scaled))df_pred`

Now let's visualize the clusters with a 2-axis plot:

`sns.scatterplot(x='total', y='speeding', hue='pred', data=df_pred, palette='Set1');`

Does the visualization make sense?

- No, because the clusters should separate their points from others. Nevertheless, we see some green points in the middle of the blue cluster.

Why is this happening?

- We are
**just representing 2 variables**where the model was**fitted with 7 variables**. We can't see the points separated as we miss 5 variables in the plot.

Why don't we add 5 variables to the plot then?

- We could, but it'd be a way too hard to interpret.

Then, what could we do?

- We can apply PCA, a dimensionality reduction technique. Take a look at the following video to understand this concept:

`PCA()`

`PCA()`

is another technique used to transform data.

How has the data been manipulated so far?

- Original Data
`df_crashes`

`df_crashes`

- Normalized Data
`df_scaled`

`df_scaled`

- Principal Components Data
`dfpca`

(now)

`from sklearn.decomposition import PCApca = PCA()data_pca = pca.fit_transform(df_scaled)data_pca[:5]`

`array([[ 1.60367129, 0.13344927, 0.31788093, -0.79529296, -0.57971878, 0.04622256, 0.21018495], [ 1.14421188, 0.85823399, 0.73662642, 0.31898763, -0.22870123, -1.00262531, 0.00896585], [ 1.43217197, -0.42050562, 0.3381364 , 0.55251314, 0.16871805, -0.80452278, -0.07610742], [ 2.49158352, 0.34896812, -1.78874742, 0.26406388, -0.37238226, -0.48184939, -0.14763646], [-1.75063825, 0.63362517, -0.1361758 , -0.97491605, -0.31581147, 0.17850962, -0.06895829]])`

`df_pca = pd.DataFrame(data_pca)df_pca`

`cols_pca = [f'PC{i}' for i in range(1, pca.n_components_+1)]cols_pca`

`['PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6', 'PC7']`

`df_pca = pd.DataFrame(data_pca, columns=cols_pca, index=df_crashes.index)df_pca`

Let's visualize a **scatterplot** with `PC1`

& `PC2`

and colour points by cluster:

`import plotly.express as pxpx.scatter(data_frame=df_pca, x='PC1', y='PC2', color=df_pred.pred)`

Are they **mixed** now?

- No, they aren't.

That's because both PC1 and PC2 represent almost 80% of the variability of the original seven variables.

You can see the following array, where every element represents the amount of variability explained by every component:

`pca.explained_variance_ratio_`

`array([0.57342168, 0.22543042, 0.07865743, 0.05007557, 0.04011 , 0.02837999, 0.00392491])`

And the accumulated variability (79.88% until PC2):

`pca.explained_variance_ratio_.cumsum()`

`array([0.57342168, 0.7988521 , 0.87750953, 0.9275851 , 0.9676951 , 0.99607509, 1. ])`

Which variables represent these two components?

The Principal Components are produced by a **mathematical equation** (once again), which is composed of the following weights:

`df_weights = pd.DataFrame(pca.components_.T, columns=df_pca.columns, index=df_scaled.columns)df_weights`

We can observe that:

- Socio-demographical features (total, speeding, alcohol, not_distracted & no_previous) have higher coefficients (higher influence) in PC1.
- Whereas insurance features (ins_premium & ins_losses) have higher coefficients in PC2.

Principal Components is a technique that gathers the maximum variability of a set of features (variables) into Components.

Therefore, the two first Principal Components accurate a good amount of common data because we see two sets of variables that are correlated with each other:

`df_corr = df_scaled.corr()sns.heatmap(df_corr, annot=True, vmin=0, vmax=1);`

I hope that everything is making sense so far.

To ultimate the explanation, you can see below how `df_pca`

values are computed:

For example, we can multiply the weights of PC1 with the original variables for **AL**abama:

`(df_weights['PC1']*df_scaled.loc['AL']).sum()`

`1.6036712920638672`

To get the transformed value of the Principal Component 1 for **AL**abama State:

`df_pca.head()`

The same operation applies to any value of

`df_pca`

.

Now, let's go back to the PCA plot:

`px.scatter(data_frame=df_pca, x='PC1', y='PC2', color=df_pred.pred.astype(str))`

How can we interpret the clusters with the components?

Let's add information to the points thanks to animated plots from `plotly`

library:

`hover = '''<b>%{customdata[0]}</b><br><br>PC1: %{x}<br>Total: %{customdata[1]}<br>Alcohol: %{customdata[2]}<br><br>PC2: %{y}<br>Ins Losses: %{customdata[3]}<br>Ins Premium: %{customdata[4]}'''fig = px.scatter(data_frame=df_pca, x='PC1', y='PC2', color=df_pred.pred.astype(str), hover_data=[df_pca.index, df_crashes.total, df_crashes.alcohol, df_crashes.ins_losses, df_crashes.ins_premium])fig.update_traces(hovertemplate = hover)`

If you hover the mouse over the two most extreme points along the x-axis, you can see that their values coincide with the `min`

and `max`

values across socio-demographical features:

`df_crashes.agg(['min', 'max'])`

`df_crashes.loc[['DC', 'SC'],:]`

Apply the same reasoning over the two most extreme points along the y-axis. You will see the same for the *insurance* variables because they determine the positioning of the PC2 (y-axis).

`df_crashes.agg(['min', 'max'])`

`df_crashes.loc[['ID', 'LA'],:]`

Is there a way to represent the weights of the original data for the Principal Components and the points?

That's called a Biplot, which we will see later.

We can observe how we position the points along the loadings vectors. Friendly reminder: the loading vectors are the weights of the original variables in each Principal Component.

`import numpy as nploadings = pca.components_.T * np.sqrt(pca.explained_variance_)evr = pca.explained_variance_ratio_.round(2)fig = px.scatter(df_pca, x='PC1', y='PC2', color=model_km.labels_.astype(str), hover_name=df_pca.index, labels={ 'PC1': f'PC1 ~ {evr[0]}%', 'PC2': f'PC2 ~ {evr[1]}%' })for i, feature in enumerate(df_scaled.columns): fig.add_shape( type='line', x0=0, y0=0, x1=loadings[i, 0], y1=loadings[i, 1], line=dict(color="red",width=3) ) fig.add_annotation( x=loadings[i, 0], y=loadings[i, 1], ax=0, ay=0, xanchor="center", yanchor="bottom", text=feature, )fig.show()`

Dimensionality Reduction techniques have many more applications, but I hope you got the essence: they are great for grouping variables that behave similarly and later visualising many variables in just one component.

In short, you are simplifying the information of the data. In this example, we simplify the data from plotting seven to only two dimensions. Although we don't get this for free because we explain around 80% of the data's original variability.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Machine Learning Models are deployed to, for example:

- Predict objects within an image (
**Tesla**) so that the car can take actions **Spotify**recommends songs to a user so that you'd fall in love with the service- Most likely to interact posts in
**Facebook or Twitter**so that you will spend more time on the app

If you just care about getting the code to make this happen, you can forget the storytelling and get right into those lines in GitHub

If you want to follow the tutorial and understand the topic in depth, let's get started

Let's say that **we are a car sales company** and we want to make things easier for clients when they decide which car to buy.

They usually don't want to have a car that **consumes lots of fuel** `mpg`

.

Nevertheless, *they won't know this until they use the car*.

Is there a way to ** predict the consumption** based on other characteristics of the car?

- Yes, with a mathematical formula, for example:

`consumption = 2 + 3 * acceleration * 2.1 horsepower`

We have **historical data** from all cars models we have sold over the past few years.

We could use this **data to calculate the BEST mathematical formula**.

And `deploy it to a website`

with a form to solve the consumption question by themselves.

To make this happen, we will follow the structure:

- Create ML Model Object in Python
- Create an HTML Form
- Create Flask App
- Deploy to Heroku
- Visit Website and Make a Prediction

- This dataset contains information about
**car models**(rows) - For which we have some
**characteristics**(columns)

`import seaborn as snsdf = sns.load_dataset(name='mpg', index_col='name')[['acceleration', 'weight', 'mpg']]df.sample(5)`

acceleration | weight | mpg | |
---|---|---|---|

name | |||

subaru | 17.8 | 2065 | 32.3 |

bmw 2002 | 12.5 | 2234 | 26.0 |

audi 5000 | 15.9 | 2830 | 20.3 |

toyota corolla 1200 | 21.0 | 1836 | 32.0 |

ford gran torino (sw) | 16.0 | 4638 | 14.0 |

`from sklearn.linear_model import LinearRegressionmodel = LinearRegression()model.fit(X=df[['acceleration', 'weight']], y=df['mpg'])model.__dict__`

`{'fit_intercept': True, 'normalize': False, 'copy_X': True, 'n_jobs': None, 'positive': False, 'n_features_in_': 2, 'coef_': array([ 0.25081589, -0.00733564]), '_residues': 7317.984100916719, 'rank_': 2, 'singular_': array([16873.21840634, 49.92970477]), 'intercept_': 41.39982830200016}`

And the BEST mathematical formula is:

`consumption = 41.39 + 0.25 * acceleration - 0.0073 * weight`

`LinearRegression()`

into a File- The object
`LinearRegression()`

contains the Mathematical Formula - That we will use in the website to make the
`prediction`

`import picklewith open('linear_regression_model.pkl', 'wb') as f: pickle.dump(model, f)`

Now a file called `linear_regression_model.pkl`

should appear in the **same folder that your script**.

All websites that you see online are displayed through an HTML file.

Therefore, we need to create an HTML file that contains a `form`

for the user to **input the data**.

And **calculate the prediction for the fuel consumption**.

Website example here

- Let's head over a Code Editor (VSCode in my case) and create a new file called
`index.html`

You may download Visual Studio Code (VSCode) here

That should contain the following lines:

`<!DOCTYPE html><html lang="en"> <head> <meta charset="UTF-8" /> <meta http-equiv="X-UA-Compatible" content="IE=edge" /> <meta name="viewport" content="width=device-width, initial-scale=1.0" /> <title>Document</title> </head> <body> <form> <label for="acceleration">Acceleration (m/s^2):</label><br /> <input type="number" id="acceleration" name="acceleration" value="34" /><br /> <label for="weight">Weight (kg):</label><br /> <input type="number" id="weight" name="weight" value="12" /><br /><br /> <input type="submit" value="Submit" /> </form> </body></html>`

If you open the file

`index.html`

in a browser, you will see the form.And the

`submit`

button that is supposed to calculate the prediction.Nevertheless, if you click, nothing will happen.

Because we need to develop the

`Flask`

application**to send the user input to a mathematical formula to calculate the prediction**and return that into the website.

As we are going to develop a whole application to a web server (Heroku), we need to create a **dedicated environment** with just the necessary packages.

- Let's head over the terminal and type the following commands:

`python -m venv car_consumption_predictionsource car_consumption_prediction/bin/activate`

- Now let's install the required packages:

`pip install flaskpip install scikit-learn`

Now you should open the folder

`car_consumption_prediction`

in a Code EditorAnd create a new folder

`app`

with two other folders inside:

`- app - model - templates`

- Then move the files we created before to its corresponding folders:

`- app - model - linear_regression_model.pkl - templates - index.html`

Now that we have the project structure, let's continue with the core functionality

We will build a **Python script that handles the user input** and make the prediction for fuel consumption

- So, create a new file within
`app`

folder called`app.py`

PS:This is the most important file in a`Flask`

app because itmanages everything.

`- app - model - linear_regression_model.pkl - templates - index.html - app.py`

- And add the following lines of code:

`import flaskimport picklewith open(f'model/linear_regression_model.pkl', 'rb') as f: model = pickle.load(f)app = flask.Flask(__name__, template_folder='templates')@app.route('/', methods=['GET', 'POST'])def main(): if flask.request.method == 'GET': return(flask.render_template('index.html')) elif flask.request.method == 'POST': acceleration = flask.request.form['acceleration'] weight = flask.request.form['weight'] input_variables = [[acceleration, weight]] prediction = model.predict(input_variables)[0] return flask.render_template('index.html', original_input={'Acceleration': acceleration, 'Weight': weight}, result=prediction, )if __name__ == '__main__': app.run()`

We need to pay attention to what's going on in the last

`return ...`

:The function

`render_template()`

is passing the objects from parameters`original_input`

and`result`

to`index.html`

Then, how can we use this variables in the file

`index.html`

?Copy-paste the following lines of code into

`index.html`

:

`<!DOCTYPE html><html lang="en"> <head> <meta charset="UTF-8" /> <meta http-equiv="X-UA-Compatible" content="IE=edge" /> <meta name="viewport" content="width=device-width, initial-scale=1.0" /> <title>Document</title> </head> <body> <form action="{{ url_for('main') }}" method="POST"> <label for="acceleration">Acceleration (m/s^2):</label><br /> <input type="number" id="acceleration" name="acceleration" required /><br /> <label for="weight">Weight (kg):</label><br /> <input type="number" id="weight" name="weight" required /><br /><br /> <input type="submit" value="Submit" /> </form> <br /> {% if result %} <p> The calculated fuel consumption is <span style="color: orange">{{result}}</span> </p> {% endif %} </body></html>`

We made two changes to the file:

Specify the action to take when

`form`

is submitted:`<form action="{{ url_for('main') }}" method="POST">`

Show the prediction below the form

`{% if result %} <p> The calculated fuel consumption is <span style="color: orange">{{result}}</span> </p> {% endif %}`

In this case, we had to use the conditional `if`

to display `result`

if existed, as `result`

won't exist until the form is submitted and the `server`

computes the prediction in `app.py`

.

I did some research about an error in which Heroku wasn't working the way I expected

And found that I needed to add a `Procfile`

Create a file in the folder

`app`

called`procfile`

Write the following line and save the file:

`web: gunicorn app:app`

The folder structure will now be:

`- app - model - linear_regression_model.pkl - templates - index.html - app.py - procfile`

Install the

`gunicorn`

package in the virtual environment. In terminal:`pip install gunicorn`

Now it's the time to upload the application to Heroku so that anyone can get its prediction on fuel comsumption given a car's `acceleration`

and `weight`

.

- Create an Account in Heroku.
- Download Heroku CLI
- Create the Heroku App within the Terminal:

`heroku create ml-model-deployment-car-mpg`

This will be traduced into a website called https://ml-model-deployment-car-mpg.herokuapp.com/

PS:You should use a different name instead of`ml-model-deployment-car-mpg`

heroku will turn your repository into an`url`

.

**Commit the app files to your heroku hosting**.

`git init`

within`car_consumption_prediction`

folderCreate a

`requirements.txt`

file with the instruction for required packages. You could automatically create this by:`pip freeze > requirements.txt`

The folder structure will now be:

`- app - model - linear_regression_model.pkl - templates - index.html - app.py - procfile - requirements.txt`

Add the files for commit.

`git add .`

Commit the files to the remote

`git commit -m 'some random message' git push heroku master`

That's all the technical aspect.

Now if some user would like to use the app...

- Visit https://ml-model-deployment-car-mpg.herokuapp.com/
- Introduce some numbers in the form
- Submit and watch the prediction

- https://blog.cambridgespark.com/deploying-a-machine-learning-model-to-the-web-725688b851c7

Es cuando empiezas a buscar los msteres ms prestigiosos y relacionados con los puestos de trabajo ms demandados de hoy en da y encuentras algunos relacionados con la Inteligencia artificial y con Python; tambin otros sobre Big data o Machine Learning. Con tanta oferta, te pica la curiosidad y sopesas las distintas oportunidades. **Descubres un inters por la programacin** que antes estaba escondido, ya que la informtica nunca ha sido tu fuerte y eliges un mster adecuado a tus **nuevas inquietudes**. Por supuesto este mster te har destacar en el mercado laboral, facilitndote el camino hacia tu puesto deseado.

Una vez que has elegido el curso, solicitas la matrcula y te aceptan, lo que te genera una gran alegra y te da motivacin extra para empezar tus estudios. Al principio llevas todo al da, pero en un momento determinado las cosas se ponen **ms difciles de lo que esperabas** y te ves realmente perdida. En este momento, buscas **ayuda en google** y en la bibliografa que los profesores te aconsejan continuamente, y aun as no superas los problemas.

El mster merece la pena y tienes que terminarlo, por lo que decides buscar un profesor particular que te ayude con las prcticas y as salir adelante. Al principio te ves capaz de superar las prcticas a la vez que aprendes las soluciones; sin embargo, con el paso de los das, te das cuenta de que **no dispones del tiempo** suficiente para hacer las dos cosas: el trabajo se acumula. Aqu te pones nerviosa y empiezas a perder el sentido del aprendizaje, razn por la que empezaste este curso. **El ttulo (la titulitis) ocupa un lugar primordial en tu cabeza** y lo nico que quieres a estas alturas es poder acceder a las expectativas laborales que este te ofrece, por lo que, adems de los costes de la matrcula, te gastas una gran cantidad de dinero en el profesor que te ayuda a hacer las prcticas. Entretanto, te convences a ti misma de que en el futuro estudiars el contenido que ahora ests dejando en otras manos por la falta de tiempo, pero tu mantra se convierte en un **crculo vicioso del que no puedes salir**, ya que en cada entrevista de trabajo te exigen ejercicios previos a la obtencin del puesto que te resultan imposibles de realizar.

Tus ilusiones y sueos se desvanecen poco a poco.

Tras reflexionar seriamente, Pepa no entiende cmo ha podido llegar hasta esta situacin; cmo ha podido perder la confianza en tan poco tiempo. Lo que ella no sabe es que su caso es muy comn, que no es la nica persona que ha acabado as. En cambio, nosotros s lo sabemos, porque es la historia de muchos de nuestros clientes; es el problema de varias personas que han contactado con Sotstica. Para ello, contamos con el profesor Jess Lpez, uno de los profesores particulares mejor valorados en Espaa. Jess puede ayudaros a entender mejor el lenguaje de programacin Python con una aplicacin directa y dinmica a la Ciencia de Datos. Tras dar clase a ms de 300 personas y con ms de 3000 horas de docencia en estos dos ltimos aos, ha desarrollado una metodologa donde conecta todos los tpicos de la Ciencia de Datos y se asegura de que los comprendas para hacer cdigo por ti mismo. La duracin del programa gira en torno a las 25 horas, que se compaginan con sesiones explicativas, tareas a realizar por el alumno y sesiones de correccin.

Haz click aqu para ver las valoraciones de sus alumnos.

En nuestra pgina web podis ver el contenido del programa para convertirte en un Data Scientist creativo.

]]>- A visual representation of the data

Which data? How is it usually structured?

- In a table. For example:

`import seaborn as snsdf = sns.load_dataset('mpg', index_col='name')df.head()`

How can you Visualice this `DataFrame`

?

- We could make a point for every car based on
- weight
- mpg

`sns.scatterplot(x='weight', y='mpg', data=df);`

Which conclusions can you make out of this plot?

Well, you may observe that the location of the points are descending as we move to the right

This means that the

`weight`

of the car may produce a lower capacity to make kilometres`mpg`

How can you measure this relationship?

- Linear Regression

`from sklearn.linear_model import LinearRegressionmodel = LinearRegression()model.fit(X=df[['weight']], y=df.mpg)model.__dict__`

- Resulting in

`{'fit_intercept': True, 'normalize': False, 'copy_X': True, 'n_jobs': None, 'n_features_in_': 1, 'coef_': array([-0.00767661]), '_residues': 7474.8140143821, 'rank_': 1, 'singular_': array([16873.20281508]), 'intercept_': 46.31736442026565}`

Which is the mathematical formula for this relationship?

$$mpg = 46.31 - 0.00767 \cdot weight$$

- This equation means that the
`mpg`

gets 0.00767 units lower for**every unit**that`weight`

**increases**.

Could you visualise this equation in a plot?

- Absolutely, we could make the predictions from the original data and plot them.

`y_pred = model.predict(X=df[['weight']])dfsel = df[['weight', 'mpg']].copy()dfsel['prediction'] = y_preddfsel.head()`

weight | mpg | prediction | |
---|---|---|---|

name | |||

chevrolet chevelle malibu | 3504 | 18.0 | 19.418523 |

buick skylark 320 | 3693 | 15.0 | 17.967643 |

plymouth satellite | 3436 | 18.0 | 19.940532 |

amc rebel sst | 3433 | 16.0 | 19.963562 |

ford torino | 3449 | 17.0 | 19.840736 |

Out of this table, you could observe that predictions don't exactly match the reality, but it approximates.

For example, Ford Torino's

`mpg`

is 17.0, but our model predicts 19.84.

`sns.scatterplot(x='weight', y='mpg', data=dfsel)sns.scatterplot(x='weight', y='prediction', data=dfsel);`

- The blue points represent the actual data.
- The orange points represent the predictions of the model.

]]>I teach Python, R, Statistics & Data Science. I like to produce content that helps people to understand these topics better.

Feel free and welcomed to give me feedback as I would like to make my tutorials clearer and generate content that interests you 馃

You can see my Tutor Profile here if you need Private Tutoring lessons.

It's tough to find things that always work the same way in programming.

The steps of a Machine Learning (ML) model can be an exception.

Each time we want to compute a model *(mathematical equation)* and make predictions with it, we would always make the following steps:

`model.fit()`

to**compute the numbers**of the mathematical equation..`model.predict()`

to**calculate predictions**through the mathematical equation.`model.score()`

to measure**how good the model's predictions are**.

And I am going to show you this with 3 different ML models.

`DecisionTreeClassifier()`

`RandomForestClassifier()`

`LogisticRegression()`

But first, let's load a dataset from CIS executing the lines of code below:

- The goal of this dataset is
- To predict
`internet_usage`

ofpeople(rows)- Based on their
socio-demographical characteristics(columns)

`import pandas as pddf = pd.read_csv('https://raw.githubusercontent.com/jsulopz/data/main/internet_usage_spain.csv')df.head()`

internet_usage | sex | age | education | |
---|---|---|---|---|

0 | 0 | Female | 66 | Elementary |

1 | 1 | Male | 72 | Elementary |

2 | 1 | Male | 48 | University |

3 | 0 | Male | 59 | PhD |

4 | 1 | Female | 44 | PhD |

We need to transform the categorical variables to **dummy variables** before computing the models:

`df = pd.get_dummies(df, drop_first=True)df.head()`

Now we separate the variables on their respective role within the model:

`target = df.internet_usageexplanatory = df.drop(columns='internet_usage')`

`from sklearn.tree import DecisionTreeClassifiermodel = DecisionTreeClassifier()model.fit(X=explanatory, y=target)pred_dt = model.predict(X=explanatory)accuracy_dt = model.score(X=explanatory, y=target)`

`from sklearn.svm import SVCmodel = SVC()model.fit(X=explanatory, y=target)pred_sv = model.predict(X=explanatory)accuracy_sv = model.score(X=explanatory, y=target)`

`from sklearn.neighbors import KNeighborsClassifiermodel = KNeighborsClassifier()model.fit(X=explanatory, y=target)pred_kn = model.predict(X=explanatory)accuracy_kn = model.score(X=explanatory, y=target)`

The only thing that changes are the results of the prediction. The models are different. But they all follow the **same steps** that we described at the beginning:

`model.fit()`

to compute the mathematical formula of the model`model.predict()`

to calculate predictions through the mathematical formula`model.score()`

to get the success ratio of the model

You may observe in the following table how the *different models make different predictions*, which often doesn't coincide with reality (misclassification).

For example, `model_svm`

doesn't correctly predict the row 214; as if this person *used internet* `pred_svm=1`

, but they didn't: `internet_usage`

for 214 in reality is 0.

`df_pred = pd.DataFrame({'internet_usage': df.internet_usage, 'pred_dt': pred_dt, 'pred_svm': pred_sv, 'pred_lr': pred_kn})df_pred.sample(10, random_state=7)`

internet_usage | pred_dt | pred_svm | pred_lr | |
---|---|---|---|---|

214 | 0 | 0 | 1 | 0 |

2142 | 1 | 1 | 1 | 1 |

1680 | 1 | 0 | 0 | 0 |

1522 | 1 | 1 | 1 | 1 |

325 | 1 | 1 | 1 | 1 |

2283 | 1 | 1 | 1 | 1 |

1263 | 0 | 0 | 0 | 0 |

993 | 0 | 0 | 0 | 0 |

26 | 1 | 1 | 1 | 1 |

2190 | 0 | 0 | 0 | 0 |

Then, we could choose the model with a **higher number of successes** on predicting the reality.

`df_accuracy = pd.DataFrame({'accuracy': [accuracy_dt, accuracy_sv, accuracy_kn]}, index = ['DecisionTreeClassifier()', 'SVC()', 'KNeighborsClassifier()'])df_accuracy`

accuracy | |
---|---|

DecisionTreeClassifier() | 0.859878 |

SVC() | 0.783707 |

KNeighborsClassifier() | 0.827291 |

Which is the best model here?

- Let me know in the comments below