Resolving Python

#01 Challenge | Delhi's Air Quality Data

Jesús López — Fri, 04 Nov 2022 11:32:24 GMT

We have started a biweekly series of challenges in this Study Circle. After considering the topics you have suggested in the comments, we are kicking off with Time Series.

Why this Data topic?

This morning, I read the Economist Espresso on India's pollution season, and I thought it was a good idea to start the series of challenges with this topic.

Getting the Data

After navigating many websites, such as India's Central Pollution Control Board and WHO, I found this website about Air Quality Data where we can download the data from many places worldwide.

I chose Delhi to be the city we will analyze in this challenge.

Executing the following lines of code will produce the DataFrame we'll work with:

import pandas as pddf = pd.read_csv('anand-vihar, delhi-air-quality.csv', parse_dates=['date'], index_col=0)df

I needed to process the data to deliver a workable dataset in the following way:

#remove whitespaces in columnsdf.columns = df.columns.str.strip()#get the rows with the numbers (some of them where whitespaces)series = df['pm25'].str.extract('(\w+)')[0]#rolling average to armonize the data monthlyseries_monthly = series.rolling(30).mean()#remove missing datesseries_monthly = series_monthly.dropna()#fill missing dates by linear interpolationseries_monthly = series_monthly.interpolate(method='linear')#sorting the index to later make a reasonable plotseries_monthly = series_monthly.sort_index()#aggregate the information by monthseries_monthly = series_monthly.to_period('M').groupby(level='date').mean()#process a timestamp to avoid errors with statsmodels' functionsseries_monthly = series_monthly.to_timestamp()#setting freq to avoid errors with statsmodels' functionsseries_monthly = series_monthly.asfreq("MS").interpolate()#change the name of the pandas.Seriesseries_monthly.name = 'air pollution pm25'

As we don't know the coding skills of this Study Circle member, we'll start with simple ARIMA models. From this point, we will iterate the procedure and improve the dynamic.

To take on the challenge and maybe, receive some feedback, you should fork this repository to your GitHub account. Otherwise, you can download this script.

The end goal is to develop an ARIMA model and plot the predictions against the actual data. Resulting in a plot like the this.

Nevertheless, you can develop this challenge in any way you find attractive. The essential point of this Study Circle is the interactivity between the members to generate value and knowledge.

From your feedback, we could later work on different use cases. For example, we could later create a geospatial map in Python with the predictions.

So, let's get on and good luck!

You start with the following object:

Learning Materials

Check out the following materials to learn how you could develop the challenge:

Video Tutorial: How to develop ARIMA models to predict Stock Price

Start the challenge

series_monthly

date2014-01-01    286.0234572014-02-01    281.428205                 ...    2022-08-01    115.4870972022-09-01    143.713333Freq: MS, Name: air pollution pm25, Length: 105, dtype: float64

It's not the same to observe the data in numbers than in a chart:

series_monthly.plot();

We aim to compute a mathematical equation that we will later use to calculate predictions, as we can see in the following chart:

There are many types of mathematical equations, the one we will use is ARIMA. Don't worry about the maths, we need a Python function to make it all for us.

from statsmodels.tsa.arima.model import ARIMA

The parameters of this Class ask for two objects:

endog: the data
order: (p,d,q)
1. p: the first significant lag in the Autocorrelation Plot
2. d: the diff needed to make our data stationary
3. q: the first significant lag in the Partial Autocorrelation Plot

`d` | Diff to get data stationarity

The first thing we need to check about our data is stationarity. We use the Augmented Dickey-Fuller test intending to reject the null hypothesis in which we state that the data is non-stationary. If that's the case, we need to differentiate the time series and adjust the number d:1 in the parameter order=(p, d:1, q).

from statsmodels.tsa.stattools import adfullerresult = adfuller(series_monthly)

The p-value is given by the second element the function adfuller returns:

result[1]

-> 0.4244071993737921

The p-value is greater than 0.05. Therefore, we can't reject the null hypothesis.

Are we done here?

No, we can differentiate the Time Series by one lag and test again:

series_monthly_diff_1 = series_monthly.diff().dropna()result = adfuller(series_monthly_diff_1)result[1]

-> 2.4066471086483724e-24

We can reject the null hypothesis and say that our data is stationary with a lag of 1. Therefore, we need to set d:1 in the order parameter of the ARIMA() class.

`q` | Autocorrelation Plot

Now we need to determine q based on the first significant lag of the autocorrelation plot:

from statsmodels.graphics.tsaplots import plot_acfplot_acf(series_monthly_diff_1, lags=50)plt.xlabel('Lag');

The first significant lag is the 2, which means that our differentiated data (monthly) is correlated every two months. We set q=2.

`p` | Partial Autocorrelation Plot

We follow the same procedure to choose a number for p. But this time, we use another type of plot: Partial Autocorrelation.

from statsmodels.graphics.tsaplots import plot_pacfplot_pacf(series_monthly_diff_1, lags=50, method='ywm')plt.xlabel('Lag');

We see the first significant lag at 2. Therefore, we set p=2.

We already know which numbers we set on the order parameter: order=(p:2, d:1, q:2). So, let's fit the mathematical equation of the model.

model = ARIMA(series_monthly, order=(2,1,2))result = model.fit()result.summary()

And calculate the predictions:

import matplotlib.pyplot as pltplt.figure(figsize=(6,4))series_monthly.plot(label='Actual Data')result.predict().plot(label='Predicted Data')plt.legend()plt.xticks(rotation=45);

Summarise Time Series data with the DataFrame.resample function

Jesús López — Thu, 03 Nov 2022 09:31:38 GMT

Don't think of a for loop if you want to summarise your daily Time Series by years.

Instead, use the function resample() from pandas.

Let me explain it with an example.

We start by loading a DataFrame from a CSV file that contains information on the TSLA stock from 2017-2022.

import pandas as pdurl = 'https://raw.githubusercontent.com/jsulopzs/data/main/tsla_stock.csv'df_tsla = pd.read_csv(filepath_or_buffer=url)df_tsla

cc: @elonmusk
You're welcome for the promotion 😉

You must ensure that column Date's dtype is DateTime.

It must not be an object as in the picture (often interpreted as a string).

df_tsla.dtypes.to_frame(name='dtype')

We need to convert the Date column into a datetime dtype. To do so, we can use the function pd.to_datetime():

df_tsla.Date = pd.to_datetime(df_tsla.Date)df_tsla.dtypes.to_frame(name='dtype')

Before getting into the resample() function, we need to set the column Date as the index of the DataFrame:

df_tsla.set_index('Date', inplace=True)df_tsla

Now let the magic happen; we'll get the maximum value of each column by each year with this simple line of code:

df_tsla.resample(rule='Y').max()

We can do many other things:

Summarise by Quarter.
Calculate the average and the standard deviation (volatility).

df_tsla.resample(rule='Q').agg(['mean', 'std'])

To finish it, I always like to add a background_gradient() to the DataFrame:

df_tsla.resample(rule='Y').max().style.background_gradient('Greens')

If you enjoyed this, I'd appreciate it if you could support my work by spreading the word 😊

#06 | Locating & Filtering the pandas.DataFrame

Jesús López — Fri, 14 Oct 2022 23:29:48 GMT

Possibilities

Sometimes, we want to select specific parts of the DataFrame to highlight some data points.

In this case, we refer to the topic as locating & filtering.

For example, let's load the dataset of cars:

import seaborn as snsdf_mpg = sns.load_dataset('mpg', index_col='name').drop(columns=['cylinders', 'model_year', 'origin'])df_mpg

To filter the best cars in each statistics/column.

First, we calculate the maximum values in each column:

df_mpg.max()

mpg 46.6 displacement 455.0 horsepower 230.0 weight 5140.0 acceleration 24.8 dtype: float64

Then, we create a mask (array with True/False) to capture the rows where we have the cars with maximum values:

mask_max = (df_mpg == df_mpg.max()).sum(axis=1) > 0mask_max

name chevrolet chevelle malibu False buick skylark 320 False ...
ford ranger False chevy s-10 False Length: 398, dtype: bool

Select the rows where the mask is True:

df_mpg_max = df_mpg[mask_max].copy()df_mpg_max

And add some styling:

df_mpg_max.style.format('{:.0f}').background_gradient()

To understand the reasoning behind the previous example, read the rest of the article, where we explain the logic from the most basic example to locating data based on the index.

Any Object

By now, we should know the difference between the brackets [] and the parenthesis ().

We use brackets to select parts of an object. For example, let's create a list of days:

list_days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

And select the second element:

list_days[1]

'Tuesday'

Or the last element:

list_days[-1]

'Sunday'

Until the third element (included):

list_days[:3]

['Monday', 'Tuesday', 'Wednesday']

Nevertheless, the list is a simple element of Python. To get more functionalities, we use the Series object from pandas library.

Series

Let's create a Series to store the Apple Stock Return on Investment (ROI) by quarters:

import pandas as pdsr_apple = pd.Series(    data=[59.02, 63.57, 66.93, 69.05],    index=['1Q', '2Q', '3Q', '4Q'])sr_apple

1Q 59.02 2Q 63.57 3Q 66.93 4Q 69.05 dtype: float64

`iloc` (integer-location) property

We use .iloc[] to select parts of the object based on the integer position of the element.

For example, let's select the first quarter ROI:

sr_apple.iloc[0]

59.02

Now, let's select the first and third quarters:

To select more than one object, we need to use double brackets [[]]:

sr_apple.iloc[[0,2,3]]

1Q 59.02 3Q 66.93 4Q 69.05 dtype: float64

Could we have accessed with the name 1Q?

sr_apple.iloc['Q1']

---------------------------------------------------------------------------

TypeError Traceback (most recent call last)

Input In [99], in () ----> 1 sr_apple.iloc['Q1']

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexing.py:967, in _LocationIndexer.getitem(self, key) 964 axis = self.axis or 0 966 maybe_callable = com.apply_if_callable(key, self.obj) --> 967 return self._getitem_axis(maybe_callable, axis=axis)

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexing.py:1517, in _iLocIndexer._getitem_axis(self, key, axis) 1515 key = item_from_zerodim(key) 1516 if not is_integer(key): -> 1517 raise TypeError("Cannot index by location index with a non-integer key") 1519 # validate the location 1520 self._validate_integer(key, axis)

TypeError: Cannot index by location index with a non-integer key

The iloc property only works in integers (the position of the subelements we want).

To select the elements by their label/name, we need to use the loc property:

`loc` (location) property

We select parts of an object with the .loc[] instance based on the label/name of the index:

sr_apple.loc['1Q']

59.02

sr_apple.loc[['1Q', '3Q', '4Q']]

1Q 59.02 3Q 66.93 4Q 69.05 dtype: float64

If we would like to access by the position, we'd get an error:

sr_apple.loc[0]

---------------------------------------------------------------------------

KeyError Traceback (most recent call last)

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3621, in Index.get_loc(self, key, method, tolerance) 3620 try: -> 3621 return self._engine.get_loc(casted_key) 3622 except KeyError as err:

File ~/miniforge3/lib/python3.9/site-packages/pandas/_libs/index.pyx:136, in pandas._libs.index.IndexEngine.get_loc()

File ~/miniforge3/lib/python3.9/site-packages/pandas/_libs/index.pyx:163, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5198, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5206, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

The above exception was the direct cause of the following exception:

KeyError Traceback (most recent call last)

Input In [102], in () ----> 1 sr_apple.loc[0]

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexing.py:1202, in _LocIndexer._getitem_axis(self, key, axis) 1200 # fall thru to straight lookup 1201 self._validate_key(key, axis) -> 1202 return self._get_label(key, axis=axis)

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexing.py:1153, in _LocIndexer._get_label(self, label, axis) 1151 def _get_label(self, label, axis: int): 1152 # GH#5667 this will fail if the label is not present in the axis. -> 1153 return self.obj.xs(label, axis=axis)

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/generic.py:3864, in NDFrame.xs(self, key, axis, level, drop_level) 3862 new_index = index[loc] 3863 else: -> 3864 loc = index.get_loc(key) 3866 if isinstance(loc, np.ndarray): 3867 if loc.dtype == np.bool_:

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3623, in Index.get_loc(self, key, method, tolerance) 3621 return self._engine.get_loc(casted_key) 3622 except KeyError as err: -> 3623 raise KeyError(key) from err 3624 except TypeError: 3625 # If we have a listlike key, _check_indexing_error will raise 3626 # InvalidIndexError. Otherwise we fall through and re-raise 3627 # the TypeError. 3628 self._check_indexing_error(key)

KeyError: 0

It results in KeyError because we don't have any Key in the index to be 0:

sr_apple

1Q 59.02 2Q 63.57 3Q 66.93 4Q 69.05 dtype: float64

We have:

sr_apple.keys()

Index(['1Q', '2Q', '3Q', '4Q'], dtype='object')

The loc property only works with the labels, not the position.

Masking with boolean objects

Now we'd like to select parts based on a condition. For example, let's show the quarters we had a Return on Investment (ROI) above 60.

First, we create a boolean object based on the stated condition:

sr_apple

1Q 59.02 2Q 63.57 3Q 66.93 4Q 69.05 dtype: float64

sr_apple > 60

1Q False 2Q True 3Q True 4Q True dtype: bool

mask_60 = sr_apple > 60

Now we pass the previous object to the .loc property:

sr_apple.loc[mask_60]

2Q 63.57 3Q 66.93 4Q 69.05 dtype: float64

And here, we have the data for which the ROI is higher than 60.

Just the brackets `[]`

sr_apple

1Q 59.02 2Q 63.57 3Q 66.93 4Q 69.05 dtype: float64

We could also access the data by only using the brackets, without the ~.iloc~ property:

sr_apple['1Q']

59.02

And also, the position:

sr_apple[0]

59.02

And the mask:

sr_apple[mask_60]

2Q 63.57 3Q 66.93 4Q 69.05 dtype: float64

So far, we have played with 1-Dimensional objects. Now it's time to level up and play with 2-Dimensional objects, like the DataFrame.

DataFrame

Let's play with a dataset of cars:

import seaborn as snsdf_mpg = sns.load_dataset(name='mpg', index_col='name')df_mpg

`iloc` (integer-location) property

We can select the second row:

df_mpg.iloc[2]

mpg 18.0 cylinders 8 displacement 318.0 horsepower 150.0 weight 3436 acceleration 11.0 model_year 70 origin usa Name: plymouth satellite, dtype: object

And keep the DataFrame style if we use double brackets [[]]:

df_mpg.iloc[[2]]

We can also slice (a term used for filtering as well) consecutive elements of the DataFrame with the colon :.

For example, let's select the first 4 rows:

df_mpg.iloc[:4]

Instead of:

df_mpg.iloc[[0,1,2,3]]

We can also select the columns we want.

For example, let's select the first 3 columns:

df_mpg.iloc[:4, :3]

Learn how to become an independent Data Analyist programmer who knows how to extract meaningful insights from Data Visualizations.

Or the rest of the columns from the 3rd position (not included):

df_mpg.iloc[:4, 3:]

Or the last 3 columns by using the -:

df_mpg.iloc[:4, -3:]

`loc` (location) property

We can also select parts of the DataFrame based on the index and column labels (2-Dimensions):

df_mpg.loc[['ford torino', 'fiat 124 sport coupe'], ['origin', 'model_year', 'cylinders']]

df_mpg.loc[:'fiat 124 sport coupe', :'cylinders']

Masking with boolean objects

Single Condition

Out of all the cars:

df_mpg.index

Index(['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth satellite', 'amc rebel sst', 'ford torino', 'ford galaxie 500', 'chevrolet impala', 'plymouth fury iii', 'pontiac catalina', 'amc ambassador dpl', ... 'chrysler lebaron medallion', 'ford granada l', 'toyota celica gt', 'dodge charger 2.2', 'chevrolet camaro', 'ford mustang gl', 'vw pickup', 'dodge rampage', 'ford ranger', 'chevy s-10'], dtype='object', name='name', length=398)

We could select all the fiat cars if we had a boolean array based on this condition:

mask_fiat = df_mpg.index.str.contains('fiat')mask_fiat

array([False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, True, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False])

We can observe a few Trues where we find some Fiats.

Let's filter them and show all the columns with the ::

df_mpg.loc[mask_fiat, :]

Although we could have omitted the ::

df_mpg.loc[mask_fiat]

Multiple Conditions

Both Conditions `&`

Just the fiats whose horsepower is above 80:

mask_hp = df_mpg.horsepower > 80mask_hp

name chevrolet chevelle malibu True buick skylark 320 True ...
ford ranger False chevy s-10 True Name: horsepower, Length: 398, dtype: bool

df_mpg.loc[mask_hp & mask_fiat, :]

Any Condition `|`

We could also select all fiats OR cars whose horsepower is above 80:

df_mpg.loc[mask_hp | mask_fiat, :]

Just the brackets `[]`

We can select the columns by their labels:

df_mpg['acceleration']

name chevrolet chevelle malibu 12.0 buick skylark 320 11.5 ... ford ranger 18.6 chevy s-10 19.4 Name: acceleration, Length: 398, dtype: float64

df_mpg[['acceleration', 'origin', 'model_year']]

But we can't select the rows by the index labels:

df_mpg['amc rebel sst']

---------------------------------------------------------------------------

KeyError Traceback (most recent call last)

...

KeyError: 'amc rebel sst'

Unless we use the colon ::

df_mpg[:'amc rebel sst']

df_mpg['buick skylark 320':'amc rebel sst']

We can also select the rows by position:

df_mpg[:4]

But we can't select both rows and columns (2-Dimensions):

df_mpg[:4,:3]

---------------------------------------------------------------------------

TypeError Traceback (most recent call last)

...

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexes/base.py:5637, in Index._check_indexing_error(self, key) 5633 def _check_indexing_error(self, key): 5634 if not is_scalar(key): 5635 # if key is not a scalar, directly raise an error (the code below 5636 # would convert to numpy arrays and raise later any way) - GH29926 -> 5637 raise InvalidIndexError(key)

InvalidIndexError: (slice(None, 4, None), slice(None, 3, None))

Unless we specify the columns we want in extra brackets:

df_mpg[:4]['acceleration']

name chevrolet chevelle malibu 12.0 buick skylark 320 11.5 plymouth satellite 11.0 amc rebel sst 12.0 Name: acceleration, dtype: float64

df_mpg[:4][['acceleration']]

df_mpg[:4][['acceleration', 'origin']]

We can also select the rows given boolean-arrays (a.k.a. masks):

df_mpg[mask_fiat]

df_mpg[mask_fiat | mask_hp]

df_mpg[mask_fiat & mask_hp]

It doesn't mean that I cannot later select the columns that we want (programming is the art of everything, we just need to find a way):

df_mpg[mask_fiat & mask_hp]['mpg']

name fiat 124 sport coupe 26.0 fiat 131 28.0 Name: mpg, dtype: float64

df_mpg[mask_fiat & mask_hp][['mpg', 'origin', 'model_year']]

Everything may be a bit confusing, but we hope you get the main idea behind locating and masking:

Select the parts of an object with brackets []
We can access it through
1. The label/name loc
2. The integer position iloc
3. Masks: boolean arrays based on conditions
4. Just the brackets []*
If the object has:
1. 1-Dimension object[:]
2. 2-Dimension object[:,:]

*Carefully because it has many variations of use case, as we observed above

DataFrame MultiIndex

Let's load a dataset with various categorical columns since we summarise data based on categories, not numbers.

df_tips = sns.load_dataset(name='tips')df_tips

Let's make a pivot table to summarise the information to obtain a Hierarchical* DataFrame.

dfres = df_tips.pivot_table(index=['smoker', 'time'], columns='sex', aggfunc='size')dfres

*A Hierarchical DataFrame (MultiIndex) contains two "columns" as an index. As we may observe below:

dfres.index

MultiIndex([('Yes', 'Lunch'), ('Yes', 'Dinner'), ( 'No', 'Lunch'), ( 'No', 'Dinner')], names=['smoker', 'time'])

First Index

Let's locate some parts of the Hierarchical DataFrame:

dfres

By using the .loc property:

dfres.loc['Yes', :]

dfres.loc['No', :]

Second Index

As we have multiple indexes [index1, index2, columns], we can select a part of the second index:

dfres.loc[:, 'Lunch', :]

dfres.loc[:, 'Dinner', :]

DataFrame MultiIndex & MultiColumns

Let's now play with a DataFrame that is both MultiIndex and MultiColumns:

dfres = df_tips.pivot_table(index=['smoker', 'time'], columns=['sex', 'day'], aggfunc='size')dfres

We may observe two levels in the columns above.

`loc` (location) property

First Index

We apply the same reasoning we used in the previous sections, [index1, index2, column1, column2].

dfres.loc['No', :, :, :]

Although, we can make it shorter.

dfres.loc['No', :]

Second Index

The same applies to the second index:

dfres.loc[:,'Dinner', :, :]

dfres.loc[:,'Dinner', :]

Second Index & Second Column

Let's try to get Dinners on Sundays:

dfres.loc[:, 'Dinner', :, 'Sun']

---------------------------------------------------------------------------

IndexError Traceback (most recent call last)

Input In [158], in () ----> 1 dfres.loc[:, 'Dinner', :, 'Sun']

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexing.py:961, in _LocationIndexer.getitem(self, key) 959 if self._is_scalar_access(key): 960 return self.obj._get_value(*key, takeable=self._takeable) --> 961 return self._getitem_tuple(key) 962 else: 963 # we by definition only have the 0th axis 964 axis = self.axis or 0

...

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexes/frozen.py:70, in FrozenList.getitem(self, n) 68 if isinstance(n, slice): 69 return type(self)(super().getitem(n)) ---> 70 return super().getitem(n)

IndexError: list index out of range

To make it work, this time we need to create an intermediate object to separate rows and columns:

idx = pd.IndexSlicedfres.loc[idx[:, 'Dinner'], idx[:, 'Sun']]

Second Index & First Column

dfres.loc[idx[:, 'Dinner'], idx['Male', :]]

Using the Slice

We can also use the slice() property:

dfres.loc[('Yes', slice(None)), (slice(None), 'Sun')]

dfres.loc['Yes', ('Female', slice(None))]

dfres.loc[(slice(None), 'Lunch'), 'Female']

dfres.loc[(slice(None), 'Lunch'), ('Female', slice(None))]

dfres.loc[idx[:, 'Dinner'], idx['Female', :]]

`iloc` (integer-location) property

dfres

As always, we can select by the position of the values with the iloc property:

dfres.iloc[:2, :2]

dfres.iloc[:2, 2:]

DataFrame with DateTimeIndex

Now, we will use a DataFrame that has a DateTimeIndex:

df_tsla = pd.read_excel('tsla_stock.xlsx', index_col=0)df_tsla

`loc` (location) property

We can select parts of the DataFrame based on just one part of the DateTimeIndex. For example, we can select everything from the year 2020 and move forward:

df_tsla.loc['2020':]

Until the last day of 2020:

df_tsla.loc[:'2020']

Between two years:

df_tsla.loc['2019':'2020']

One complete year:

df_tsla.loc['2019']

We can even select a specific year-month:

df_tsla.loc['2019-06']

`iloc` (integer-location) property

Of course, we can also select parts of the DataFrame based on the position of the values with iloc:

df_tsla.iloc[:4, :3]

df_tsla.iloc[-4:, :3]

How Edo Guida got a Python job in three months with no programming background

Jesús López — Sat, 10 Sep 2022 14:11:18 GMT

Skilled Python Professionals' Demand

Take a look at this article to understand why Python is the programming language of the present and the future.

Okay, you've already got a reason to learn Python: you will have more chances because Python-related job offers will grow more and more over the coming years.

As the previous article states, you may end up in a job earning 65k a year; you may not worry so much about money when making decisions as you move forward in life!

IT Jobs Watch, a website that specialises in collating salary data across the IT industry, states that the median annual salary in the UK for a role requiring Python skills is 65,000.

Before getting there, keep your feet on Earth because you need to master Python.

Get a Job to Learn Python

Don't think you need a Computer Science degree or a Data Science master's to master Python before getting the job.

The best motivation to learn anything is that they pay you for it. Therefore, it would help to prioritise getting a job where you increase your Python skills.

Companies care about getting shit done. Therefore, you must show them what you know about programming and how you program.

Complete online courses to show what you know with certificates
1. Solve their assignments
2. Get your data and experiment with the learnt concepts
Showcase your knowledge on GitHub to show how you program (you may look at Edo's profile to see his portfolio)

He followed our advice and got a job in two months. He applied to around a hundred job offers on LinkedIn where recruiters could see his certifications.

Where to start

Make it easy at the beginning with easy-to-understand Python code.

Some people use scripts to code. It'd help if you turned to the notebook format because you could see the output of every line right away. Follow this tutorial to install Jupyter Lab, the best program to work with notebooks and write your Python's first lines of code.

I have found Data Visualization to be the best starting topic because you immediately see how the output changes as you change the code. It gives you a massive dope of energy.

Follow this Data Visualization tutorial to get a complete overview of Data Visualization development in Python. Then, play around with the lines of code: add more data points to the plots or change the colour of the figures.

Once you are motivated and comfortable using Python, it is time to follow a proper learning path.

Roadmap

You can follow any roadmaps, but please make sure you don't overestimate your skills and start developing Neural Networks if you don't know how to create a simple Linear Regression.

You can follow Edo's roadmap by looking at his certifications:

You can also read the following thread, where I placed links to practical exercises you can use in your portfolio.

https://twitter.com/jsulopzs/status/1521115468429348864

#05 | The k-Means & Unsupervised Clustering Models

Jesús López — Tue, 06 Sep 2022 08:40:01 GMT

Challenge Importance

The time has come to add another layer to the hierarchy of Machine Learning models.

Do we have the variable we want to predict in the dataset?

YES: Supervised Learning

Predicting a Numerical Variable Regression
Predicting a Categorical Variable Classification

NO: Unsupervised Learning

Group Data Points based on Explanatory Variables Cluster Analysis

We may have, for example, all football players, and we want to group them based on their performance. But we don't know the groups beforehand. So what do we do then?

We apply Unsupervised Machine Learning models to group the players based on their position in the space (determined by the explanatory variables): the closer the players are to the space, the more likely they'll be drawn to the same group.

Another typical example comes from e-commerces that don't know if their customers like clothing or tech. But they know how they interact on the website. Therefore, they group the customers to send promotional emails that align with their likes.

In short, we close the circle with the different types of Machine Learning models by adding this new type.

Let's now develop the Python code.

Load the Data

Imagine for a second you are the President of the United States of America, and you are considering creating campaigns to reduce car accidents due to alcohol consumption controlling the impact of insurance companies' losses (columns).

You won't create 51 TV campaigns for each of the States of the USA (rows). Instead, you will see which States behave similarly to cluster them into three groups.

import seaborn as sns #!import pandas as pddf_crashes = sns.load_dataset(name='car_crashes', index_col='abbrev')[['alcohol', 'ins_losses']]df_crashes

Data Preprocessing

Missing Data

We don't have any missing data in any of the columns:

df_crashes.isna().sum()

alcohol 0 ins_losses 0 dtype: int64

Dummy Variables

Neither we need to convert categorical columns to dummy variables because the two we are considering are numerical.

df_crashes

How do we compute a k-Means Model in Python?

We should know from previous chapters that we need a function accessible from a Class in the library sklearn.

Import the Class

from sklearn.cluster import KMeans

Instantiante the Class

Create a copy of the original code blueprint to not "modify" the source code.

model_km = KMeans()

Access the Function

The theoretical action we'd like to perform is the one we executed in previous chapters. Therefore, the function to compute the Machine Learning model should be called the same way:

model_km.fit()

---------------------------------------------------------------------------

TypeError Traceback (most recent call last)

Input In [6], in () ----> 1 model_km.fit()

TypeError: fit() missing 1 required positional argument: 'X'

The previous types of models asked for two parameters:

y: target ~ independent ~ label ~ class variable
X: explanatory ~ dependent ~ feature variables

Why is it asking for just one parameter now, X?

As we said before, this model (unsupervised learning) doesn't know how the groups are calculated beforehand; they know after we compute the Machine Learning model. Therefore, they don't need to see the target variable y.

Separate the Variables

We don't need to separate the variables because we have explanatory ones.

Fit the Model

model_km.fit(X=df_crashes)

KMeans()

Predictions

Calculate Predictions

We have a fitted KMeans. Therefore, we should be able to apply the mathematical equation to the original data to get the predictions:

model_km.predict(X=df_crashes)

array([7, 3, 0, 7, 6, 7, 6, 1, 3, 7, 7, 5, 4, 7, 0, 5, 3, 3, 2, 4, 2, 3, 1, 3, 1, 7, 4, 5, 7, 5, 1, 5, 1, 3, 0, 3, 6, 0, 1, 1, 5, 4, 1, 1, 0, 0, 1, 0, 1, 0, 5], dtype=int32)

We wanted to calculate three groups, but Python is calculating eight groups. Let's modify this hyperparameter of the KMeans model:

model_km = KMeans(n_clusters=3)model_km.fit(X=df_crashes)model_km.predict(X=df_crashes)

array([0, 0, 1, 0, 2, 0, 2, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 2, 1, 2, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 2, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1], dtype=int32)

Add a New Column with the Predictions

Let's create a new DataFrame to keep the original dataset untouched:

df_pred = df_crashes.copy()

And add the predictions:

df_pred['pred_km'] = model_km.predict(X=df_crashes)df_pred

How can we see the groups in the plot?

Model Visualization & Interpretation

Can you observe that the k-Means only considers the variable ins_losses to determine the group the point belongs to? Why?

sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_km',                palette='Set1', data=df_pred);

The model measures the distance between the points. They seem to be spread around the plot but aren't; the plot doesn't place the points in perspective (it's lying to us).

Learn how to become an independent Machine Learning programmer who knows when to apply any ML algorithm to any dataset.

KMeans Algorithm

Take a look at the following video to understand how the KMeans algorithm computes the Mathematical Equation by calculating distances:

https://www.youtube.com/watch?v=4b5d3muPQmA

The model understands the data as follows:

import matplotlib.pyplot as pltsns.scatterplot(x='alcohol', y='ins_losses', hue='pred_km',                palette='Set1', data=df_pred)plt.xlim(0, 200)plt.ylim(0, 200);

Now it's evident why the model only took into account ins_losses: it barely sees significant distances within alcohol compared to ins_losses.

In other words, with a metaphor: it is not the same to increase one kg of weight than one meter of height.

Then, how can we create a KMeans model that compares the two variables equally?

We need to scale the data (i.e., transforming the values into the same range: from 0 to 1) with the MinMaxScaler.

`MinMaxScaler()` the data

As with any other algorithm within the sklearn library, we need to:

Import the Class
Create the instance
fit() the numbers of the mathematical equation
predict/transform the data with the mathematical equation

from sklearn.preprocessing import MinMaxScalerscaler = MinMaxScaler()scaler.fit(df_crashes)data = scaler.transform(df_crashes)data[:5]

array([[0.47921847, 0.55636883], [0.34718769, 0.45684192], [0.42806394, 0.24636258], [0.50100651, 0.5323574 ], [0.20923623, 0.73980184]])

To better understand the information, let's convert the array into a DataFrame:

df_scaled = pd.DataFrame(data, columns=df_crashes.columns, index=df_crashes.index)df_scaled

k-Means Model with Scaled Data

Fit the Model

model_km.fit(X=df_scaled)

KMeans(n_clusters=3)

Predictions

Calculate Predictions

We have a fitted KMeans. Therefore, we should be able to apply the mathematical equation to the original data to get the predictions:

model_km.predict(X=df_scaled)

array([1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 2, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 2, 0, 1, 0, 1, 0, 1, 0, 2, 0, 1, 0, 1, 1, 2, 2, 1, 1, 0, 0, 1, 0, 1, 0, 0], dtype=int32)

Add a New Column with the Predictions

df_pred['pred_km_scaled'] = model_km.predict(X=df_scaled)df_pred

Model Visualization & Interpretation

We can observe now that both alcohol and ins_losses are taken into account by the model to calculate the cluster a point belongs to.

sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_km_scaled',                palette='Set1', data=df_pred);

Takeaway

From now on, we should understand that every time a model calculates distances between variables of different numerical ranges, we need to scale the data to compare them properly.

The following figure gives an overview of everything that has happened so far:

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(14, 7))sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_km',                data=df_pred, palette='Set1', ax=ax1);sns.scatterplot(x='alcohol', y='ins_losses', hue=df_pred.pred_km_scaled,                data=df_scaled, palette='Set1', ax=ax2);sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_km',                data=df_pred, palette='Set1', ax=ax3);sns.scatterplot(x='alcohol', y='ins_losses', hue=df_pred.pred_km_scaled,                data=df_scaled, palette='Set1', ax=ax4);ax3.set_xlim(0, 200)ax3.set_ylim(0, 200)ax4.set_xlim(0, 200)ax4.set_ylim(0, 200)ax1.set_title('KMeans w/ Original Data & Liar Plot')ax2.set_title('KMeans w/ Scaled Data & Perspective Plot')ax3.set_title('KMeans w/ Original Data & Perspective Plot')ax4.set_title('KMeans w/ Original Data & Perspective Plot')plt.tight_layout()

Other `Clustering` Models in Python

Visit the sklearn website to see how many different clustering methods are and how they differ from each other.

Let's pick two new models and compute them:

Agglomerative Clustering

We follow the same procedure as for any Machine Learning model from the Scikit-Learn library:

Fit the Model

from sklearn.cluster import AgglomerativeClusteringmodel_ac = AgglomerativeClustering(n_clusters=3)model_ac.fit(df_scaled)

AgglomerativeClustering(n_clusters=3)

Calculate Predictions

model_ac.fit_predict(X=df_scaled)

array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 1, 0, 1, 0, 1, 0, 1, 2, 0, 0, 1, 0, 0, 2, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1])

Create a New Column for the Predictions

df_pred['pred_ac'] = model_ac.fit_predict(X=df_scaled)df_pred

Visualize the Model

We can observe how the second group takes three points with the Agglomerative Clustering while the KMeans gather five points in the second group.

As they are different algorithms, they are expected to produce different results. If you'd like to understand which models you should use, you may know how the algorithm works. We don't explain it in this series because we want to make it simple.

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_km_scaled',                data=df_pred, palette='Set1', ax=ax1);sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_ac',                data=df_pred, palette='Set1', ax=ax2)ax1.set_title('KMeans')ax2.set_title('Agglomerative Clustering');

Spectral Clustering

We follow the same procedure as for any Machine Learning model from the Scikit-Learn library:

Fit the Model

from sklearn.cluster import SpectralClusteringmodel_sc = SpectralClustering(n_clusters=3)model_sc.fit(df_scaled)

SpectralClustering(n_clusters=3)

Calculate Predictions

model_sc.fit_predict(X=df_scaled)

array([0, 2, 2, 0, 0, 2, 0, 0, 2, 0, 0, 1, 2, 0, 2, 2, 2, 0, 0, 2, 0, 2, 0, 2, 0, 0, 1, 2, 0, 2, 0, 2, 0, 2, 1, 2, 0, 2, 0, 0, 1, 1, 0, 0, 2, 2, 0, 2, 0, 2, 2], dtype=int32)

Create a New Column for the Predictions

df_pred['pred_sc'] = model_sc.fit_predict(X=df_scaled)df_pred

Visualize the Model

Let's visualize all models together and appreciate the minor differences because they cluster the groups differently.

fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 5))ax1.set_title('KMeans')sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_km_scaled',                data=df_pred, palette='Set1', ax=ax1);ax2.set_title('Agglomerative Clustering')sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_ac',                data=df_pred, palette='Set1', ax=ax2);ax3.set_title('Spectral Clustering')sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_sc',                data=df_pred, palette='Set1', ax=ax3);

Takeaway

Once again, you don't need to know the maths behind every Machine Learning model to build them. However, I hope you are getting a sense of the patterns behind the Scikit-Learn library with this series of tutorials.

Use Case Conclusion

Let's arbitrarily choose the Agglomerative Clustering as our model and get back to you being the President of the USA. How would you describe the groups?

Higher ins_losses and lower alcohol
Lower ins_losses and lower alcohol
Lower ins_losses and higher alcohol

sns.scatterplot(x='alcohol', y='ins_losses', hue='pred_ac', data=df_pred, palette='Set1');

You would create different messages on the TV campaigns for the three groups separately and avoid deploying many more resources to develop fifty-one various TV campaigns (one for each State), which doesn't make sense because many of them are similar.

#05 | DateTime Object's Potential within Pandas, a Python Library

Jesús López — Sat, 27 Aug 2022 21:46:02 GMT

Jess Lpez

Ask him any doubt on Twitter or LinkedIn

Possibilities

Look at the following example as an aspiration you can achieve if you fully understand and replicate this whole tutorial with your data.

Let's load a dataset containing information on the Tesla Stock daily (rows) transactions (columns) in the Stock Market.

import pandas as pdurl = 'https://raw.githubusercontent.com/jsulopzs/data/main/tsla_stock.csv'df_tesla = pd.read_csv(url, index_col=0, parse_dates=['Date'])df_tesla

You may calculate the .mean() of each column by the last Business day of each Month (BM):

df_tesla.resample('BM').mean()

Or the Weekly Average:

df_tesla.resample('W-FRI').mean()

And many more; see the full list here.

Pretty straightforward compared to other libraries and programming languages.

It's not a casualty they say Python is the future language because its libraries simplify many operations where most people believe they would have needed a for loop.

Let's apply other pandas techniques to the DateTime object:

df_tesla['year'] = df_tesla.index.yeardf_tesla['month'] = df_tesla.index.month

The following values represent the average Close price by each month-year combination:

df_tesla.pivot_table(index='year', columns='month', values='Close', aggfunc='mean').round(2)

We could even style it to get a better insight by colouring the cells:

df_stl = df_tesla.pivot_table(    index='year',    columns='month',    values='Close',    aggfunc='mean',    fill_value=0).style.format('{:.2f}').background_gradient(axis=1)df_stl

And they represent the volatility with the standard deviation:

df_stl = df_tesla.pivot_table(    index='year',    columns='month',    values='Close',    aggfunc='std',    fill_value=0).style.format('{:.2f}').background_gradient(axis=1)df_stl

In this article, we'll dig into the details of the Panda's DateTime-related object in Python to understand the required knowledge to come up with awesome calculations like the ones we saw above.

First, let's reload the dataset to start from the basics.

df_tesla = pd.read_csv(url, parse_dates=['Date'])df_tesla

Series DateTime

An essential part of learning something is the practicability and the understanding of counterexamples where we understand the errors.

Let's go with basic thinking to understand the importance of the DateTime object and how to work with it. So, out of all the columns in the DataFrame, we'll now focus on Date:

df_tesla.Date

0      2017-01-031      2017-01-04          ...    1378   2022-06-241379   2022-06-27Name: Date, Length: 1380, dtype: datetime64[ns]

What information could we get from a DateTime object?

We may think we can get the month, but it turns out we can't in the following manner:

df_tesla.Date.month

---------------------------------------------------------------------------AttributeError                            Traceback (most recent call last)Input In [53], in <cell line: 1>()----> 1 df_tesla.Date.monthFile ~/miniforge3/lib/python3.9/site-packages/pandas/core/generic.py:5575, in NDFrame.__getattr__(self, name)   5568 if (   5569     name not in self._internal_names_set   5570     and name not in self._metadata   5571     and name not in self._accessors   5572     and self._info_axis._can_hold_identifiers_and_holds_name(name)   5573 ):   5574     return self[name]-> 5575 return object.__getattribute__(self, name)AttributeError: 'Series' object has no attribute 'month'

Programming exists to simplify our lives, not make them harder.

Someone has probably developed a simpler functionality if you think there must be a simpler way to perform certain operations. Therefore, don't limit programming applications to complex ideas and rush towards a for loop, for example; proceed through trial and error without losing hope.

In short, we need to bypass the dt instance to access the DateTime functions:

df_tesla.Date.dt

<pandas.core.indexes.accessors.DatetimeProperties object at 0x16230a2e0>

Process the Month

df_tesla.Date.dt.month

0       11       1       ..1378    61379    6Name: Date, Length: 1380, dtype: int64

We can use more elements than just .month:

Process the Month Name

df_tesla.Date.dt.month_name()

0       January1       January         ...   1378       June1379       JuneName: Date, Length: 1380, dtype: object

Process the Year, Week & Day

df_tesla.Date.dt.isocalendar()

Process the Quarter

df_tesla.Date.dt.quarter

0       11       1       ..1378    21379    2Name: Date, Length: 1380, dtype: int64

Process the Year-Month for each Date

df_tesla.Date.dt.to_period('M')

0       2017-011       2017-01         ...   1378    2022-061379    2022-06Name: Date, Length: 1380, dtype: period[M]

Process the Weekly Period for each Date

df_tesla.Date.dt.to_period('W-FRI')

0       2016-12-31/2017-01-061       2016-12-31/2017-01-06                ...          1378    2022-06-18/2022-06-241379    2022-06-25/2022-07-01Name: Date, Length: 1380, dtype: period[W-FRI]

Time Zones

Pandas contain functionality that allows us to place Time Zones into the objects to ease the work of data from different countries and regions.

Before getting deeper into Time Zones, we need to set the Date as the index (rows) of the DataFrame:

df_tesla.set_index('Date', inplace=True)df_tesla

We can tell Python the DateTimeIndex of the DataFrame comes from Madrid:

df_tesla.index = df_tesla.index.tz_localize('Europe/Madrid')df_tesla

And change it to another Time Zone, like Moscow:

df_tesla.index.tz_convert('Europe/Moscow')

DatetimeIndex(['2017-01-03 02:00:00+03:00', '2017-01-04 02:00:00+03:00',               '2017-01-05 02:00:00+03:00', '2017-01-06 02:00:00+03:00',               ...               '2022-06-22 01:00:00+03:00', '2022-06-23 01:00:00+03:00',               '2022-06-24 01:00:00+03:00', '2022-06-27 01:00:00+03:00'],              dtype='datetime64[ns, Europe/Moscow]', name='Date', length=1380, freq=None)

We could have applied the transformation in the DataFrame object itself:

df_tesla.tz_convert('Europe/Moscow')

We can observe the hour has changed accordingly.

The Pandas Time Zone functionality is useful for combining timed data from different regions around the globe.

Summarising the Dates

To summarise, for example, the information of daily operations into months, we can apply different functions with each one having its unique ability (it's up to you to select the one that suits your needs):

.groupby()
.resample()
.pivot_table()

Let's show some examples:

Groupby

df_tesla.groupby(by=df_tesla.index.year).Volume.sum()

Date2017     79501570002018    108081940002019    115402420002020    190529124002021     69026905002022     3407576732Name: Volume, dtype: int64

The function .groupby() packs the rows of the same year:

df_tesla.groupby(by=df_tesla.index.year)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1622eecd0>

To later summarise the total volume in each pack as we saw before.

An easier way?

Resample

df_tesla.Volume.resample('Y').sum()

Date2017-12-31 00:00:00+01:00     79501570002018-12-31 00:00:00+01:00    108081940002019-12-31 00:00:00+01:00    115402420002020-12-31 00:00:00+01:00    190529124002021-12-31 00:00:00+01:00     69026905002022-12-31 00:00:00+01:00     3407576732Freq: A-DEC, Name: Volume, dtype: int64

We first select the column in which we want to apply the operation:

df_tesla.Volume

Date2017-01-03 00:00:00+01:00    296165002017-01-04 00:00:00+01:00    56067500                               ...   2022-06-24 00:00:00+02:00    318665002022-06-27 00:00:00+02:00    21237332Name: Volume, Length: 1380, dtype: int64

And apply the .resample() function to take a Date Offset to aggregate the DateTimeIndex. In this example, we aggregate by year 'Y':

df_tesla.Volume.resample('Y')

<pandas.core.resample.DatetimeIndexResampler object at 0x16230abe0>

And apply mathematical operations to the aggregated objects separately as we saw before:

df_tesla.Volume.resample('Y').sum()

Date2017-12-31 00:00:00+01:00     79501570002018-12-31 00:00:00+01:00    108081940002019-12-31 00:00:00+01:00    115402420002020-12-31 00:00:00+01:00    190529124002021-12-31 00:00:00+01:00     69026905002022-12-31 00:00:00+01:00     3407576732Freq: A-DEC, Name: Volume, dtype: int64

We could have also calculated the .sum() for all the columns if we didn't select just the Volume:

df_tesla.resample('Y').sum()

As always, we should strive to represent the information in the clearest manner for anyone to understand. Therefore, we could even visualize the aggregated volume by year with two more words:

df_tesla.Volume.resample('Y').sum().plot.bar();

Let's now try different Date Offsets:

Monthly

df_tesla.Volume.resample('M').sum()

Date2017-01-31 00:00:00+01:00    5033980002017-02-28 00:00:00+01:00    597700000                               ...    2022-05-31 00:00:00+02:00    6494072002022-06-30 00:00:00+02:00    572380932Freq: M, Name: Volume, Length: 66, dtype: int64

df_tesla.Volume.resample('M').sum().plot.line();

Weekly

df_tesla.Volume.resample('W').sum()

Date2017-01-08 00:00:00+01:00    1428820002017-01-15 00:00:00+01:00    105867500                               ...    2022-06-26 00:00:00+02:00    1412342002022-07-03 00:00:00+02:00     21237332Freq: W-SUN, Name: Volume, Length: 287, dtype: int64

df_tesla.Volume.resample('W').sum().plot.area();

df_tesla.Volume.resample('W-FRI').sum()

Date2017-01-06 00:00:00+01:00    1428820002017-01-13 00:00:00+01:00    105867500                               ...    2022-06-24 00:00:00+02:00    1412342002022-07-01 00:00:00+02:00     21237332Freq: W-FRI, Name: Volume, Length: 287, dtype: int64

df_tesla.Volume.resample('W-FRI').sum().plot.line();

Quarterly

df_tesla.Volume.resample('Q').sum()

Date2017-03-31 00:00:00+02:00    16362745002017-06-30 00:00:00+02:00    2254740000                                ...    2022-03-31 00:00:00+02:00    16788020002022-06-30 00:00:00+02:00    1728774732Freq: Q-DEC, Name: Volume, Length: 22, dtype: int64

df_tesla.Volume.resample('Q').sum().plot.bar();

Pivot Table

We can also use Pivot Tables for summarising and nicer represent the information:

df_res = df_tesla.pivot_table(    index=df_tesla.index.month,    columns=df_tesla.index.year,    values='Volume',    aggfunc='sum')df_res

And even apply some style to get more insight on the DataFrame:

df_tesla['Volume_M'] = df_tesla.Volume/1_000_000dfres = df_tesla.pivot_table(    index=df_tesla.index.month,    columns=df_tesla.index.year,    values='Volume_M',    aggfunc='sum')df_stl = dfres.style.format('{:.2f}').background_gradient('Reds', axis=1)df_stl

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

#04 | Overfitting & Hyperparameter Tuning with Cross Validation

Jesús López — Fri, 26 Aug 2022 18:10:36 GMT

Chapter Importance

We have already covered:

Regression Models
Classification Models
Train Test Split for Model Selection

In short, we have computed all possible types of models to predict numerical and categorical variables with Regression and Classification models, respectively.

Although it is not enough to compute one model, we need to compare different models to choose the one whose predictions are the closest to reality.

Nevertheless, we cannot evaluate the model on the same data we used to .fit() (train) the mathematical equation (model). Therefore, we need to separate the data into train and test sets; the first to train the model, the later to evaluate the model.

We add an extra layer of complexity because we can improve a model (an algorithm) by configuring its parameters. This chapter is about computing different combinations of a single model's hyperparameters to get the best.

Load the Data

The goal of this dataset is
To predict if bank's customers (rows) default next month
Based on their socio-demographical characteristics (columns)

import pandas as pdpd.set_option("display.max_columns", None)url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls'df_credit = pd.read_excel(io=url, header=1, index_col=0)df_credit.sample(10)

Preprocess the Data

Missing Data

The function .fit() needs all the cells in the DataFrame to contain a value. NaN means "Not a Number" (i.e., cell for which we don't have any information). Otherwise, it won't know how to process the row and compare it to others.

df_credit.isna().sum()

LIMIT_BAL 0 SEX 0 .. PAY_AMT6 0 default payment next month 0 Length: 24, dtype: int64

df_credit.isna().sum().sum()

Dummy Variables

The function .fit() needs the values to be numeric. Otherwise, Python won't know the position of the axes in which to allocate the point.

Therefore, categories of the categorical columns will be transformed into new columns (one new column per category) and contain 1s and 0s depending on whether the person is or is not in the category.

Nevertheless, we don't need to create dummy variables because the data contains numerical variables only.

Feature Selection

So far, we have used the naming standard of target and features. Nevertheless, the most common standards on the Internet are X and y. Let's get used to it:

y = df_credit.iloc[:, -1]X = df_credit.iloc[:, :-1]

Train Test Split

From the previous chapter, we should already know we need to separate the data into train and test if we want to evaluate the model's predictive capability for data we don't know yet.

In our case, we'd like to predict if new credit card customers won't commit default in the next month. As we don't have the data for the next month (it's the future), we need to apply the function train_test_split().

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(    X, y, test_size=0.33, random_state=42)

`DecisionTreeClassifier()` with Default Hyperparameters

To compute a Machine Learning model with the default hyperparameters, we apply the same procedure we have covered in previous chapters:

from sklearn.tree import DecisionTreeClassifiermodel_dt = DecisionTreeClassifier()model_dt.fit(X_train, y_train)

DecisionTreeClassifier()

Accuracy

We can see the model is almost perfect for predicting the training data (99% of accuracy). Nevertheless, predicting test data is terrible (72% of accuracy). This phenomenon tells us that the model is incurring in overfitting.

In `train` data

model_dt.score(X_train, y_train)

0.9995024875621891

In `test` data

model_dt.score(X_test, y_test)

0.7265656565656565

Model Visualization

I'll use the following visualization to explain the concept of overfitting.

from sklearn.tree import plot_treeplot_tree(decision_tree=model_dt, feature_names=X_train.columns, filled=True);

The tree is big because we have a lot of people (20,100), and we haven't set any limit on the model.

How many people do you think we have in the deepest leaf?

Very few, probably one.

Are these people characteristic of the overall data? Or are they infrequent?

Because they are infrequent and the model is very complex, we are incurring overfitting, and we get a vast difference between train and test accuracies.

`DecisionTreeClassifier()` with Custom Hyperparameters

Which hyperparameters can we configure for the Decision Tree algorithm?

In the output below, we can configure parameters such as max_depth, criterion and min_samples_leaf, among others.

model = DecisionTreeClassifier()model.get_params()

{'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'random_state': None, 'splitter': 'best'}

Let's apply different random configurations to see how to model's accuracy changes in train and test sets.

Please pay attention to how the accuracies are similar when we reduce the model's complexity (we make the tree shorter and generalized to capture more people in the leaves).

And remember that we should pick up a good configuration based on the test accuracy.

1st Configuration

model_dt = DecisionTreeClassifier(max_depth=2, min_samples_leaf=150)model_dt.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=2, min_samples_leaf=150)

Accuracy

In `train` data

model_dt.score(X_train, y_train)

0.8186567164179105

In `test` data

model_dt.score(X_test, y_test)

0.8215151515151515

Model Visualization

plot_tree(decision_tree=model_dt, feature_names=X_train.columns, filled=True);

Learn how to become an independent Machine Learning programmer who knows when to apply any ML algorithm to any dataset.

2nd Configuration

model_dt = DecisionTreeClassifier(max_depth=3)model_dt.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=3)

Accuracy

In `train` data

model_dt.score(X_train, y_train)

0.8207960199004976

In `test` data

model_dt.score(X_test, y_test)

0.8222222222222222

Model Visualization

plot_tree(decision_tree=model_dt, feature_names=X_train.columns, filled=True);

3rd Configuration

model_dt = DecisionTreeClassifier(max_depth=4)model_dt.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=4)

Accuracy

In `train` data

model_dt.score(X_train, y_train)

0.8232338308457712

In `test` data

model_dt.score(X_test, y_test)

0.8205050505050505

Model Visualization

plot_tree(decision_tree=model_dt, feature_names=X_train.columns, filled=True);

4th Configuration

Accuracy

model_dt = DecisionTreeClassifier(min_samples_leaf=100)model_dt.fit(X_train, y_train)

DecisionTreeClassifier(min_samples_leaf=100)

In `train` data

model_dt.score(X_train, y_train)

0.8244278606965174

In `test` data

model_dt.score(X_test, y_test)

0.8161616161616162

Model Visualization

plot_tree(decision_tree=model_dt, feature_names=X_train.columns, filled=True);

5th Configuration

model_dt = DecisionTreeClassifier(max_depth=7, min_samples_leaf=100)model_dt.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=7, min_samples_leaf=100)

Accuracy

In `train` data

model_dt.score(X_train, y_train)

0.8237313432835821

In `test` data

model_dt.score(X_test, y_test)

0.8177777777777778

Model Visualization

plot_tree(decision_tree=model_dt, feature_names=X_train.columns, filled=True);

We have similar results; the accuracy goes around 82% on the test set when we configure a general model which doesn't have a considerable depth (as the first one).

But we should ask ourselves another question: can we do this process of automatically checking multiple combinations of hyperparameters?

Yes, and that's where Cross Validation gets in.

`GridSearchCV()` to find Best Hyperparameters

The Cross-Validation technique splits the training data into n number of folds (5 in the image below). Then, it computes each hyperparameter configuration n times, where each fold will be taken as a test set once.

Consider that we .fit() a model as many times as folds are multiplied by the number of combinations we want to try.

Out of the Decision Tree hyperparameters:

model_dt = DecisionTreeClassifier()model_dt.get_params()

We want to try the following combinations of max_depth (6), min_samples_leaf (7) and criterion (2):

from sklearn.model_selection import GridSearchCVparam_grid = {    'max_depth': [None, 2, 3, 4, 5, 10],    'min_samples_leaf': [1, 50, 100, 200, 400, 800, 1600],    'criterion': ['gini', 'entropy']}cv_dt = GridSearchCV(estimator=model_dt, param_grid=param_grid, cv=5, verbose=1)

They make up to 420 times we compute the function.fit()

5*6*7*2

420

To compare 84 different combinations of the Decision Tree hyperparameters:

6*7*2

cv_dt.fit(X_train, y_train)

Fitting 5 folds for each of 84 candidates, totalling 420 fits

GridSearchCV(cv=5, estimator=DecisionTreeClassifier(), param_grid={'criterion': ['gini', 'entropy'], 'max_depth': [None, 2, 3, 4, 5, 10], 'min_samples_leaf': [1, 50, 100, 200, 400, 800, 1600]}, verbose=1)

If we specify verbose=2, we will see how many fits we perform in the output:

cv_dt = GridSearchCV(estimator=model_dt, param_grid=param_grid, cv=5, verbose=2)cv_dt.fit(X_train, y_train)

Fitting 5 folds for each of 84 candidates, totalling 420 fits [CV] END .criterion=gini, max_depth=None, min_samples_leaf=1; total time= 0.2s [CV] END .criterion=gini, max_depth=None, min_samples_leaf=1; total time= 0.2s ... [CV] END criterion=entropy, max_depth=10, min_samples_leaf=1600; total time= 0.1s [CV] END criterion=entropy, max_depth=10, min_samples_leaf=1600; total time= 0.1s

The best hyperparameter configuration is:

cv_dt.best_params_

DecisionTreeClassifier(criterion='entropy', max_depth=5, min_samples_leaf=100)

To achieve accuracy on the test set of:

cv_dt.score(X_test, y_test)

0.8186868686868687

If we'd like to have the results of every configuration:

df_cv_dt = pd.DataFrame(cv_dt.cv_results_)df_cv_dt

Other Models

Now let's try to find the best hyperparameter configuration of other models, which don't have the same hyperparameters as the Decision Tree because their algorithm and mathematical equation are different.

Support Vector Machines `SVC()`

https://www.youtube.com/watch?v=efR1C6CvhmE

Before computing the Support Vector Machines model, we need to scale the data because this model compares the distance between the explanatory variables. Therefore, they all need to be on the same scale.

from sklearn.preprocessing import MinMaxScalerscaler = MinMaxScaler()X_norm = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

We need to separate the data again to have the train and test sets with the scaled data:

>>> X_norm_train, X_norm_test, y_train, y_test = train_test_split(...     X_norm, y, test_size=0.33, random_state=42)

The Support Vector Machines contain the following hyperparameters:

from sklearn.svm import SVCsv = SVC()sv.get_params()

{'C': 1.0, 'break_ties': False, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 'scale', 'kernel': 'rbf', 'max_iter': -1, 'probability': False, 'random_state': None, 'shrinking': True, 'tol': 0.001, 'verbose': False}

From which we want to try the following combinations:

param_grid = {    'C': [0.1, 1, 10],    'kernel': ['linear', 'rbf']}cv_sv = GridSearchCV(estimator=sv, param_grid=param_grid, verbose=2)cv_sv.fit(X_norm_train, y_train)

Fitting 5 folds for each of 6 candidates, totalling 30 fits [CV] END ...............................C=0.1, kernel=linear; total time= 3.0s [CV] END ...............................C=0.1, kernel=linear; total time= 3.0s ... [CV] END ...................................C=10, kernel=rbf; total time= 5.3s [CV] END ...................................C=10, kernel=rbf; total time= 5.3s

GridSearchCV(estimator=SVC(), param_grid={'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}, verbose=2)

We should notice that some fits take up to almost 5 seconds, which is very time-expensive if we want to try thousands of combinations (professionals apply these practices). Therefore, we should know how the model's algorithm works inside to choose a good hyperparameter configuration that doesn't devote much time. Otherwise, we make the company spend a lot of money on computing power.

This tutorial dissects the Support Vector Machines algorithm works inside.

The best hyperparameter configuration is:

cv_sv.best_params_

SVC(C=10)

To achieve an accuracy on the test set of:

cv_sv.score(X_norm_test, y_test)

0.8185858585858586

If we'd like to have the results of every configuration:

df_cv_sv = pd.DataFrame(cv_sv.cv_results_)df_cv_sv

`KNeighborsClassifier()`

Now we'll compute another classification model: K Nearest Neighbours.

https://www.youtube.com/watch?v=HVXime0nQeI

We check for its hyperparameters:

from sklearn.neighbors import KNeighborsClassifiermodel_kn = KNeighborsClassifier()model_kn.get_params()

{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', 'metric_params': None, 'n_jobs': None, 'n_neighbors': 5, 'p': 2, 'weights': 'uniform'}

To choose the following combinations:

param_grid = {    'leaf_size': [10, 20, 30, 50],    'metric': ['minkowski', 'euclidean', 'manhattan'],    'n_neighbors': [3, 5, 10, 20]}cv_kn = GridSearchCV(estimator=kn, param_grid=param_grid, verbose=2)cv_kn.fit(X_norm_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits [CV] END ......leaf_size=10, metric=minkowski, n_neighbors=3; total time= 1.5s [CV] END ......leaf_size=10, metric=minkowski, n_neighbors=3; total time= 1.3s ... [CV] END .....leaf_size=50, metric=manhattan, n_neighbors=20; total time= 1.1s [CV] END .....leaf_size=50, metric=manhattan, n_neighbors=20; total time= 1.1s

GridSearchCV(estimator=KNeighborsClassifier(), param_grid={'leaf_size': [10, 20, 30, 50], 'metric': ['minkowski', 'euclidean', 'manhattan'], 'n_neighbors': [3, 5, 10, 20]}, verbose=2)

The best hyperparameter configuration is:

cv_kn.best_params_

KNeighborsClassifier(leaf_size=10, n_neighbors=20)

To achieve an accuracy on the test set of:

cv_kn.score(X_norm_test, y_test)

0.8185858585858586

If we'd like to have the results of every configuration:

df_cv_kn = pd.DataFrame(cv_kn.cv_results_)df_cv_kn

Best Model with Best Hyperparameters

The best algorithm at its best is the Decision Tree Classifier:

dic_results = {    'model': [        cv_dt.best_estimator_,        cv_sv.best_estimator_,        cv_kn.best_estimator_    ],    'hyperparameters': [        cv_dt.best_params_,        cv_sv.best_params_,        cv_kn.best_params_    ],    'score': [        cv_dt.score(X_test, y_test),        cv_sv.score(X_norm_test, y_test),        cv_kn.score(X_norm_test, y_test)    ]}df_cv_comp = pd.DataFrame(dic_results)df_cv_comp.style.background_gradient()

#03 | Train Test Split for Model Selection

Jesús López — Wed, 24 Aug 2022 16:11:22 GMT

Jess Lpez

Ask him any doubt on Twitter or LinkedIn

Chapter Importance

Machine Learning models learn a mathematical equation from historical data.

Not all Machine Learning models predict the same way; some models are better than others.

We measure how good a model is by calculating its score (accuracy).

So far, we have calculated the model's score using the same data to fit (train) the mathematical equation. That's cheating. That's overfitting.

This tutorial compares 3 different models:

Decision Tree
Logistic Regression
Support Vector Machines

We validate the models in 2 different ways:

Using the same data during training
Using 30% of the data; not used during training

To demonstrate how the selection of the best model changes if we are to validate the model with data not used during training.

For example, the image below shows the best model, when using the same data for validation, is the Decision Tree (0.86 of accuracy). Nevertheless, everything changes when the model is evaluated with data not used during training; the best model is the Logistic Regression (0.85 of accuracy). Whereas the Decision Tree only gets up to 0.80 of accuracy.

Were we a bank whose losses rank up to 1M USD due to 0.01 fail in accuracy, we would have lost 5M USD. This is something that happens in real life.

In short, banks are interested in good models to predict new potential customers. Not historical customers who have already gotten a loan and the bank knows if they were good to pay or not.

This tutorial shows you how to implement the train_test_split technique to reduce overfitting with a practical use case where we want to classify whether a person used the Internet or not.

Load the Data

Load the dataset from CIS, executing the following lines of code:

import pandas as pd #!df_internet = pd.read_excel('https://github.com/jsulopzs/data/blob/main/internet_usage_spain.xlsx?raw=true', sheet_name=1, index_col=0)df_internet

The goal of this dataset is
To predict internet_usage of people (rows)
Based on their socio-demographical characteristics (columns)

Preprocess the Data

We should already know from the previous chapter that the data might be preprocessed before passing it to the function that computes the mathematical equation.

Missing Data

The function .fit() all the cells in the DataFrame to contain a value. NaN means "Not a Number" (i.e., cell for which we don't have any information). Otherwise, it won't know how to process the row and compare it to others.

For example, if you miss John's age, you cannot place John in the space to compare with other people because the point might be anywhere.

df_internet.isna().sum()

internet_usage    0sex               0age               0education         0dtype: int64

Dummy Variables

The function .fit() needs the values to be numeric. Otherwise, Python won't know the position of the axes in which to allocate the point. For example, if you have Male and Female, at which distance do you separate them, and why? You cannot make an objective assessment unless you separate each category.

Therefore, categories of the categorical columns will be transformed into new columns (one new column per category) and contain 1s and 0s depending on whether the person is or is not in the category.

df_internet = pd.get_dummies(df_internet, drop_first=True)df_internet

Feature Selection

Once we have preprocessed the data, we select the column we want to predict (target) and the columns we will use to explain the prediction (features/explanatory).

target = df_internet.internet_usagefeatures = df_internet.drop(columns='internet_usage')

Build & Compare Models' Scores

We should already know that the Machine Learning procedure is the same all the time:

Computing a mathematical equation: fit
To calculate predictions: predict
And compare them to reality: score

The only element that changes is the Class() that contains lines of code of a specific algorithm (DecisionTreeClassifier, SVC, LogisticRegression).

`DecisionTreeClassifier()` Model in Python

from sklearn.tree import DecisionTreeClassifiermodel_dt = DecisionTreeClassifier()model_dt.fit(X=features, y=target)model_dt.score(X=features, y=target)

0.859877800407332

`SVC()` Model in Python

from sklearn.svm import SVCmodel_svc = SVC(probability=True)model_svc.fit(X=features, y=target)model_svc.score(X=features, y=target)

0.7837067209775967

`LogisticRegression()` Model in Python

from sklearn.linear_model import LogisticRegressionmodel_lr = LogisticRegression(max_iter=1000)model_lr.fit(X=features, y=target)model_lr.score(X=features, y=target)

0.8334012219959267

Function to Automate Lines of Code

We repeated all the time the same code:

model.fit()model.score()

Why not turn the lines into a function() to automate the process?

calculate_accuracy(model_dt)calculate_accuracy(model_sv)calculate_accuracy(model_lr)

To calculate the accuracy

Make a Procedure Sample for `DecisionTreeClassifier()`

model_dt = DecisionTreeClassifier()model_dt.fit(X=features, y=target)model_dt.score(X=features, y=target)

0.859877800407332

Automate the Procedure into a `function()`

Code Thinking

Think of the functions result
Store that object to a variable
return the result at the end
Indent the body of the function to the right
define the function():
Think of what's gonna change when you execute the function with different models
Locate the variable that you will change
Turn it into the parameter of the function()

model_dt = DecisionTreeClassifier()model_dt.fit(X=features, y=target)model_dt.score(X=features, y=target)

0.859877800407332

Distinguish the line that gives you the `result` you want and put it into a variable

model_dt = DecisionTreeClassifier()model_dt.fit(X=features, y=target)result = model_dt.score(X=features, y=target) #new

Add a line with a `return` to tell the function the object you want in the end

model_dt = DecisionTreeClassifier()model_dt.fit(X=features, y=target)result = model_dt.score(X=features, y=target)return result #new

Indent everything to the right

    model_dt = DecisionTreeClassifier()    model_dt.fit(X=features, y=target)    result = model_dt.score(X=features, y=target)    return result

Define the function in the first line

def calculate_accuracy(): #new    model_dt = DecisionTreeClassifier()    model_dt.fit(X=features, y=target)    result = model_dt.score(X=features, y=target)    return result

What am I gonna change every time I run the function

def calculate_accuracy(model_dt): #modified    model_dt.fit(X=features, y=target)    result = model_dt.score(X=features, y=target)    return result

Generalize the name of the parameter

def calculate_accuracy(model): #modified    model.fit(X=features, y=target) #modified    result = model.score(X=features, y=target)    return result

Add docstring

def calculate_accuracy(model):    """    This function calculates the accuracy for a given model as a parameter #modified    """    model.fit(X=features, y=target)    result = model.score(X=features, y=target)    return result

calculate_accuracy(model_dt)

0.859877800407332

Calculate Models' Accuracies

`DecisionTreeClassifier()` Accuracy

calculate_accuracy(model_dt)

0.859877800407332

We shall create an empty dictionary that keeps track of every model's score to choose the best one later.

dic_accuracy = {}dic_accuracy['Decision Tree'] = calculate_accuracy(model_dt)

`SVC()` Accuracy

dic_accuracy['Support Vector Machines'] = calculate_accuracy(model_svc)dic_accuracy

{'Decision Tree': 0.859877800407332, 'Support Vector Machines': 0.7837067209775967}

`LogisticRegression()` Accuracy

dic_accuracy['Logistic Regression'] = calculate_accuracy(model_lr)dic_accuracy

{'Decision Tree': 0.859877800407332, 'Support Vector Machines': 0.7837067209775967, 'Logistic Regression': 0.8334012219959267}

Which is the Best Model?

The Decision Tree is the best model with an score of 85%:

sr_accuracy = pd.Series(dic_accuracy).sort_values(ascending=False)sr_accuracy

Decision Tree              0.859878Logistic Regression        0.833401Support Vector Machines    0.783707dtype: float64

Let's suppose for a moment we are a bank to understand the importance of this chapter. A bank's business is, among other things, to give loans to people who can afford it.

Although the bank may commit mistakes: giving loans to people who cannot afford it or not giving to people who can.

Let's imagine the bank losses of 1M for each 1% of misclassification. As we chose the Decision Tree, the bank lost $15M, as the score suggests. Nevertheless, can we trust the score of 85%?

No, because we are cheating the model's evaluation; we evaluated the models with the same data used for training. In other words, the bank is not interested in evaluating the model of the historical customers; they want to know how good the model is for new customers.

They cannot create new customers. What can they do then?

They separate the data into a train set (70% of customers) used to .fit() the mathematical equation and a test set (30% of customers) to evaluate the mathematical equation.

You can understand the problem better with the following analogy:

University Access Exams Analogy

Let's imagine:

You have a math exam on Saturday
Today is Monday
You want to calibrate your level in case you need to study more for the math exam
How do you calibrate your math level?
Well, you've got 100 questions X with 100 solutions y from past years exams
You may study the 100 questions with 100 solutions fit(100questions, 100solutions)
Then, you may do a mock exam with the 100 questions predict(100questions)
And compare your_100solutions with the real_100solutions
You've got 90/100 correct answers accuracy in the mock exam
You think you are prepared for the maths exam
And when you do the real exam on Saturday, the mark is 40/100
Why? How could we have prevented this?
Solution: separate the 100 questions into 70 for train to study & 30 for test for the mock exam.
1. fit(70questions, 70answers)
2. your_30solutions = predict(30questions)
3. your_30solutions ?= 30solutions

`train_test_split()` the Data

The documentation of the function contains a typical example.

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(    features, target, test_size=0.30, random_state=42)

What the heck is returning the function?

From all the data:

2455 rows
8 columns

df_internet

1728 rows (70% of all data) to fit the model
7 columns (X: features variables)

X_train

737 rows (30% of all data) to evaluate the model
7 columns (X: features variables)

X_test

1728 rows (70% of all data) to fit the model
1 columns (y: target variable)

y_train

nameEileen     0Lucinda    1          ..Corey      0Robert     1Name: internet_usage, Length: 1718, dtype: int64

737 rows (30% of all data) to evaluate the model
1 columns (y: target variable)

y_test

nameThomas     0Pedro      1          ..William    1Charles    1Name: internet_usage, Length: 737, dtype: int64

`fit()` the model with Train Data

model_dt.fit(X_train, y_train)

DecisionTreeClassifier()

Compare the predictions with the real data

model_dt.score(X_test, y_test)

0.8046132971506106

Optimize All Models & Compare Again

Make a Procedure Sample for `DecisionTreeClassifier()`

model_dt = DecisionTreeClassifier()model_dt.fit(X_train, y_train)model_dt.score(X_test, y_test)

0.8032564450474898

Automate the Procedure into a `function()`

Code Thinking

Think of the functions result
Store that object to a variable
return the result at the end
Indent the body of the function to the right
define the function():
Think of what's gonna change when you execute the function with different models
Locate the variable that you will change
Turn it into the parameter of the function()

def calculate_accuracy_test(model):    model.fit(X_train, y_train)    result = model.score(X_test, y_test)    return result

Calculate Models' Accuracies

`DecisionTreeClassifier()` Accuracy

dic_accuracy_test = {}dic_accuracy_test['Decision Tree'] = calculate_accuracy_test(model_dt)dic_accuracy_test

{'Decision Tree': 0.8032564450474898}

`SVC()` Accuracy

dic_accuracy_test['Support Vector Machines'] = calculate_accuracy_test(model_svc)dic_accuracy_test

{'Decision Tree': 0.8032564450474898, 'Support Vector Machines': 0.7788331071913162}

`LogisticRegression()` Accuracy

dic_accuracy_test['Logistic Regression'] = calculate_accuracy_test(model_lr)dic_accuracy_test

{'Decision Tree': 0.8032564450474898, 'Support Vector Machines': 0.7788331071913162, 'Logistic Regression': 0.8548168249660787}

Which is the Best Model with `train_test_split()`?

The picture change quite a lot as the bank is losing 20M due to the model we chose before: the Decision Tree; the score in data that hasn't been seen during training (i.e., new customers) is 80%.

We should have chosen the Logistic Regression because it's the best model (85%) to predict new data and new customers.

In short, we lose 15M if we choose the Logistic Regression, which it's better than the Decision Tree's loss of 20M. Those 5M can make a difference in my life 👀

sr_accuracy_test = pd.Series(dic_accuracy_test).sort_values(ascending=False)sr_accuracy_test

Logistic Regression        0.854817Decision Tree              0.803256Support Vector Machines    0.778833dtype: float64

df_accuracy = pd.DataFrame({    'Same Data': sr_accuracy,    'Test Data': sr_accuracy_test})df_accuracy.style.format('{:.2f}').background_gradient()

#04 | Data Visualization in Python

Jesús López — Tue, 02 Aug 2022 14:40:23 GMT

Possibilities

Look at the following example as an aspiration you can achieve if you fully understand and replicate this whole tutorial with your data.

Let's load a dataset that contains information from countries (rows) considering socio-demographic and economic variables (columns).

import plotly.express as pxdf_countries = px.data.gapminder()df_countries

Python contains 3 main libraries for Data Visualization:

Matplotlib (Mathematical Plotting)
Seaborn (High-Level based on Matplotlib)
Plotly (Animated Plots)

I love plotly because the Visualizations are interactive; you may hover the mouse over the points to get information from them:

df_countries_2007 = df_countries.query('year == 2007')px.scatter(data_frame=df_countries_2007, x='gdpPercap', y='lifeExp',           color='continent', hover_name='country', size='pop')

You can even animate the plots with a simple parameter. Click on play

PS: The following example is taken from the official plotly library website:

px.scatter(df_countries, x="gdpPercap", y="lifeExp", animation_frame="year", animation_group="country",           size="pop", color="continent", hover_name="country",           log_x=True, size_max=55, range_x=[100,100000], range_y=[25,90])

In this article, we'll dig into the details of Data Visualization in Python to build up the required knowledge and develop awesome visualizations like the ones we saw before.

Matplotlib

Matplotlib is a library used for Data Visualization.

We use the sublibrary (module) pyplot from matplotlib library to access the functions.

import matplotlib.pyplot as plt

Let's make a bar plot:

plt.bar(x=['Real Madrid', 'Barcelona', 'Bayern Munich'],       height=[14, 5, 6]);

We could have also done a point plot:

plt.scatter(x=['Real Madrid', 'Barcelona', 'Bayern Munich'],            y=[14, 5, 6]);

But it doesn't make sense with the data we have represented.

Visualize DataFrame

Let's create a DataFrame:

teams = ['Real Madrid', 'Barcelona', 'Bayern Munich']uefa_champions = [14, 5, 6]import pandas as pddf_champions = pd.DataFrame(data={'Team': teams,                   'UEFA Champions': uefa_champions})df_champions

And visualize it using:

Matplotlib functions

plt.bar(x=df_champions['Team'],        height=df_champions['UEFA Champions']);

DataFrame functions

df_champions.plot.bar(x='Team', y='UEFA Champions');

Seaborn

Let's read another dataset: the Football Premier League classification for 2021/2022.

df_premier = pd.read_excel(io='../data/premier_league.xlsx')df_premier

We will visualize a point plot, from now own scatter plot to check if there is a relationship between the number of goals scored F versus the Points Pts.

import seaborn as snssns.scatterplot(x='F', y='Pts', data=df_premier);

Can we do the same plot with matplotlib plt library?

plt.scatter(x='F', y='Pts', data=df_premier);

Which are the differences between them?

The points: matplotlib points are bigger than seaborn ones
The axis labels: matplotlib axis labels are non-existent, whereas seaborn places the names of the columns

From which library do the previous functions return the objects?

seaborn_plot = sns.scatterplot(x='F', y='Pts', data=df_premier);

matplotlib_plot = plt.scatter(x='F', y='Pts', data=df_premier);

type(seaborn_plot)

matplotlib.axes._subplots.AxesSubplot

type(matplotlib_plot)

matplotlib.collections.PathCollection

Why does seaborn returns a matplotlib object?

Quoted from the seaborn official website:

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level* interface for drawing attractive and informative statistical graphics.

*High-level means the communication between humans and the computer is easier to understand than low-level communication, which goes through 0s and 1s.

Could you place the names of the teams in the points?

plt.scatter(x='F', y='Pts', data=df_premier)for idx, data in df_premier.iterrows():    plt.text(x=data['F'], y=data['Pts'], s=data['Team'])

It isn't straightforward.

Is there an easier way?

Yes, you may use an interactive plot with plotly library and display the name of the Team as you hover the mouse on a point.

Plotly

We use the express module within plotly library to access the functions of the plots:

import plotly.express as pxpx.scatter(data_frame=df_premier, x='F', y='Pts', hover_name='Team')

Learn how to become an independent Data Analyist programmer who knows how to extract meaningful insights from Data Visualizations.

Types of Plots

Let's read another dataset: the sociological data of clients in a restaurant.

df_tips = sns.load_dataset(name='tips')df_tips

One Column

Categorical Column

df_tips.sex

0 Female 1 Male ...
242 Male 243 Female Name: sex, Length: 244, dtype: category Categories (2, object): ['Male', 'Female']

We need to summarise the data first; we count how many Female and Male people are in the dataset.

df_tips.sex.value_counts()

Male 157 Female 87 Name: sex, dtype: int64

sr_sex = df_tips.sex.value_counts()

Barplot

Let's place bars equal to the number of people from each gender:

px.bar(x=sr_sex.index, y=sr_sex.values)

We can also colour the bars based on the category:

px.bar(x=sr_sex.index, y=sr_sex.values, color=sr_sex.index)

Pie plot

Let's put the same data into a pie plot:

px.pie(names=sr_sex.index, values=sr_sex.values, color=sr_sex.index)

Numerical Column

df_tips.total_bill

0 16.99 1 10.34 ...
242 17.82 243 18.78 Name: total_bill, Length: 244, dtype: float64

Histogram

Instead of observing the numbers, we can visualize the distribution of the bills in a histogram.

We can observe that most people pay between 10 and 20 dollars. Whereas a few are between 40 and 50.

px.histogram(x=df_tips.total_bill)

We can also create a boxplot where the limits of the boxes indicate the 1st and 3rd quartiles.

The 1st quartile is 13.325, and the 3rd quartile is 24.175. Therefore, 50% of people were billed an amount between these limits.

Boxplot

px.box(x=df_tips.total_bill)

Two Columns

df_tips[['total_bill', 'tip']]

Numerical & Numerical

We use a scatter plot to see if a relationship exists between two numerical variables.

Do the points go up as you move the eyes from left to right?

As you may observe in the following plot: the higher the amount of the bill, the higher the tip the clients leave for the staff.

px.scatter(x='total_bill', y='tip', data_frame=df_tips)

Another type of visualization for 2 continuous variables:

px.density_contour(x='total_bill', y='tip', data_frame=df_tips)

Numerical & Categorical

df_tips[['day', 'total_bill']]

We can summarise the data around how much revenue was generated in each day of the week.

df_tips.groupby('day').total_bill.sum()

day Thur 1096.33 Fri 325.88 Sat 1778.40 Sun 1627.16 Name: total_bill, dtype: float64

sr_days = df_tips.groupby('day').total_bill.sum()

We can observe that Saturday is the most profitable day as people have spent more money.

px.bar(x=sr_days.index, y=sr_days.values)

px.bar(x=sr_days.index, y=sr_days.values, color=sr_days.index)

Categorical & Categorical

df_tips[['day', 'size']]

Which combination of day-size is the most frequent table you can observe in the restaurant?

The following plot shows that Saturdays with 2 people at the table is the most common phenomenon at the restaurant.

They could create an advertisement that targets couples to have dinner on Saturdays and make more money.

px.density_heatmap(x='day', y='size', data_frame=df_tips)

Awesome Plots

The following examples are taken directly from plotly.

df_gapminder = px.data.gapminder()px.scatter_geo(df_gapminder, locations="iso_alpha", color="continent", #!                     hover_name="country", size="pop",                     animation_frame="year",                     projection="natural earth")

import plotly.express as pxdf = px.data.election()geojson = px.data.election_geojson()fig = px.choropleth_mapbox(df, geojson=geojson, color="Bergeron",                           locations="district", featureidkey="properties.district",                           center={"lat": 45.5517, "lon": -73.7073},                           mapbox_style="carto-positron", zoom=9)fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})

import plotly.express as pxdf = px.data.election()geojson = px.data.election_geojson()fig = px.choropleth_mapbox(df, geojson=geojson, color="winner",                           locations="district", featureidkey="properties.district",                           center={"lat": 45.5517, "lon": -73.7073},                           mapbox_style="carto-positron", zoom=9)fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})

#01 | The Linear Regression & Supervised Regression Models

Jesús López — Wed, 27 Jul 2022 13:51:00 GMT

🎯 Chapter Importance

Machine Learning is all about calculating the best numbers of a mathematical equation.

The form of a Linear Regression mathematical equation is as follows:

$$y = (a) + (b) \cdot x$$

As we see in the following plot, not any mathematical equation is valid; the red line doesn't fit the real data (blue points), whereas the green one is the best.

How do we understand the development of Machine Learning models in Python to predict what may happen in the future?

This tutorial covers the topics described below using USA Car Crashes data to predict the accidents based on alcohol.

Step-by-step procedure to compute a Linear Regression:
1. .fit() the numbers of the mathematical equation
2. .predict() the future with the mathematical equation
3. .score() how good is the mathematical equation
How to visualise the Linear Regression model?
How to evaluate Regression models step by step?
- Residuals Sum of Squares
- Total Sum of Squares
- R Squared Ratio $R^2$
How to interpret the coefficients of the Linear Regression?
Compare the Linear Regression to other Machine Learning models such as:
- Random Forest
- Support Vector Machines
Why we don't need to know the maths behind every model to apply Machine Learning in Python?

💽 Load the Data

This dataset contains statistics about Car Accidents (columns)
In each one of USA States (rows)

Visit this website if you want to know the measures of the columns.

import seaborn as sns #!df_crashes = sns.load_dataset(name='car_crashes', index_col='abbrev')[['alcohol', 'total']]df_crashes.rename({'total': 'accidents'}, axis=1, inplace=True)df_crashes

🤖 How do we compute a Linear Regression Model in Python?

As always, we need to use a function

Where is the function?

It should be in a library

Which is the Python library for Machine Learning?

Sci-Kit Learn, see website

Import the Class

How can we access the function to compute a Linear Regression model?

We need to import the LinearRegression class within linear_model module:

from sklearn.linear_model import LinearRegression

Instantiante the Class

Now, we create an instance model_lr of the class LinearRegression:

model_lr = LinearRegression()

Fit the Model

Which function applies the Linear Regression algorithm in which the Residual Sum of Squares is minimised?

model_lr.fit()

TypeError Traceback (most recent call last)
Input In [186], in ()----> 1 model_lr.fit()
TypeError: fit() missing 2 required positional arguments: 'X' and 'y'

Why is it asking for two parameters: y and X?

The algorithm must distinguish between the variable we want to predict (y), and the variables used to explain (X) the prediction.

y: target ~ independent ~ label ~ class variable
X: features ~ dependent ~ explanatory variables

Separate the Variables

target = df_crashes['accidents']features = df_crashes[['alcohol']]

Fit the Model Again

model_lr.fit(X=features, y=target)

LinearRegression()

Predictions

Calculate the Predictions

Take the historical data:

features

To calculate predictions through the Model's Mathematical Equation:

model_lr.predict(X=features)

array([17.32111171, 15.05486718, 16.44306899, 17.69509287, 12.68699734, 13.59756016, 13.76016066, 15.73575679, 9.0955587 , 16.40851638, 13.78455074, 20.44100889, 14.87600663, 14.70324359, 14.40446516, 13.8353634 , 14.54064309, 15.86177218, 19.6076813 , 15.06502971, 13.98780137, 11.69106925, 13.88211104, 11.5162737 , 16.94713055, 16.98371566, 24.99585551, 16.45729653, 15.41868581, 12.93089809, 12.23171592, 15.95526747, 13.10772614, 16.44306899, 26.26007443, 15.60161138, 17.58737003, 12.62195713, 17.32517672, 14.43088774, 25.77430543, 18.86988151, 17.3515993 , 20.84141263, 9.53254755, 14.15040187, 12.82724027, 12.96748321, 19.40239816, 15.11380986, 17.17477126])

Add a New Column with the Predictions

Can you see the difference between reality and prediction?

Model predictions aren't perfect; they don't predict the real data exactly. Nevertheless, they make a fair approximation allowing decision-makers to understand the future better.

df_crashes['pred_lr'] = model_lr.predict(X=features)df_crashes

Model Visualization

The orange dots reference the predictions lined up in a line because the Linear Regression model calculates the best coefficients (numbers) for a line's mathematical equation based on historical data.

import matplotlib.pyplot as plt

sns.scatterplot(x='alcohol', y='accidents', data=df_crashes)sns.scatterplot(x='alcohol', y='pred_lr', data=df_crashes);

We have orange dots for the alcohol represented in our DataFrame. Were we to make estimations about all possible alcohol numbers, we'd get a sequence of consecutive points, which represented a line. Let's draw it with .lineplot() function:

sns.scatterplot(x='alcohol', y='accidents', data=df_crashes)sns.scatterplot(x='alcohol', y='pred_lr', data=df_crashes);sns.lineplot(x='alcohol', y='pred_lr', data=df_crashes, color='orange');

Model's Score

Calculate the Score

To measure the quality of the model, we use the .score() function to correctly calculate the difference between the model's predictions and reality.

model_lr.score(X=features, y=target)

0.7269492966665405

Explain the Score

Residuals

The step-by-step procedure of the previous calculation starts with the difference between reality and predictions:

df_crashes['accidents'] - df_crashes['pred_lr']

abbrev AL 1.478888 AK 3.045133 ...
WI -1.313810 WY 0.225229 Length: 51, dtype: float64

This difference is usually called residuals:

df_crashes['residuals'] = df_crashes['accidents'] - df_crashes['pred_lr']df_crashes

We cannot use all the residuals to tell how good our model is. Therefore, we need to add them up:

df_crashes.residuals.sum()

1.4033219031261979e-13

Let's round to two decimal points to suppress the scientific notation:

df_crashes.residuals.sum().round(2)

0.0

But we get ZERO. Why?

The residuals contain positive and negative numbers; some points are above the line, and others are below the line.

To turn negative values into positive values, we square the residuals:

df_crashes['residuals^2'] = df_crashes.residuals**2df_crashes

And finally, add the residuals up to calculate the Residual Sum of Squares (RSS):

df_crashes['residuals^2'].sum()

231.96888653310063

RSS = df_crashes['residuals^2'].sum()

$$RSS = \sum(y_i - \hat{y})^2$$

where

y_i is the real number of accidents
$\hat y$ is the predicted number of accidents
RSS: Residual Sum of Squares

Target's Variation

The model was made to predict the number of accidents.

We should ask: how good are the variation of the model's predictions compared to the variation of the real data (real number of accidents)?

We have already calculated the variation of the model's prediction. Now we calculate the variation of the real data by comparing each accident value to the average:

df_crashes.accidents

abbrev AL 18.8 AK 18.1 ... WI 13.8 WY 17.4 Name: accidents, Length: 51, dtype: float64

df_crashes.accidents.mean()

15.79019607843137

$$y_i - \bar y$$

Where x is the number of accidents

df_crashes.accidents - df_crashes.accidents.mean()

abbrev AL 3.009804 AK 2.309804 ...
WI -1.990196 WY 1.609804 Name: accidents, Length: 51, dtype: float64

df_crashes['real_residuals'] = df_crashes.accidents - df_crashes.accidents.mean()df_crashes

We square the residuals due for the same reason as before (convert negative values into positive ones):

df_crashes['real_residuals^2'] = df_crashes.real_residuals**2

$$TTS = \sum(y_i - \bar y)^2$$

where

y_i is the number of accidents
$\bar y$ is the average number of accidents
TTS: Total Sum of Squares

And we add up the values to get the Total Sum of Squares (TSS):

df_crashes['real_residuals^2'].sum()

849.5450980392156

TSS = df_crashes['real_residuals^2'].sum()

The Ratio

The ratio between RSS and TSS represents how much our model fails concerning the variation of the real data.

RSS/TSS

0.2730507033334595

0.27 is the badness of the model as RSS represents the residuals (errors) of the model.

To calculate the goodness of the model, we need to subtract the ratio RSS/TSS to 1:

$$R^2 = 1 - \frac{RSS}{TSS} = 1 - \frac{\sum(y_i - \hat{y})^2}{\sum(y_i - \bar y)^2}$$

1 - RSS/TSS

0.7269492966665405

The model can explain 72.69% of the total number of accidents variability.

The following image describes how we calculate the goodness of the model.

Model Interpretation

How do we get the numbers of the mathematical equation of the Linear Regression?

We need to look inside the object model_lr and show the attributes with .__dict__ (the numbers were computed with the .fit() function):

model_lr.__dict__

{'fit_intercept': True, 'normalize': 'deprecated', 'copy_X': True, 'n_jobs': None, 'positive': False, 'feature_namesin': array(['alcohol'], dtype=object), 'n_featuresin': 1, 'coef_': array([2.0325063]), 'residues': 231.9688865331006, 'rank': 1, 'singular': array([12.22681605]), 'intercept': 5.857776154826299}

intercept_ is the (a) number of the mathematical equation
coef_ is the (b) number of the mathematical equation

$$accidents = (a) + (b) \cdot alcohol \accidents = (intercept_) + (coef_) \cdot alcohol \accidents = (5.857) + (2.032) \cdot alcohol$$

For every unit of alcohol increased, the number of accidents will increase by 2.032 units.

import pandas as pddf_to_pred = pd.DataFrame({'alcohol': [1,2,3,4,5]})df_to_pred['pred_lr'] = 5.857 + 2.032 * df_to_pred.alcoholdf_to_pred['diff'] = df_to_pred.pred_lr.diff()df_to_pred

🚀 Other Regression Models

Could we make a better model that improves the current Linear Regression Score?

model_lr.score(X=features, y=target)

0.7269492966665405

Let's try a Random Forest and a Support Vector Machines.

Do we need to know the maths behind these models to implement them in Python?

No. As we explain in this tutorial, all you need to do is:
1. fit()
2. .predict()
3. .score()
4. Repeat

`RandomForestRegressor()` in Python

Fit the Model

from sklearn.ensemble import RandomForestRegressormodel_rf = RandomForestRegressor()model_rf.fit(X=features, y=target)

RandomForestRegressor()

Calculate Predictions

model_rf.predict(X=features)

array([18.644 , 16.831 , 17.54634286, 21.512 , 12.182 , 13.15 , 12.391 , 17.439 , 7.775 , 17.74664286, 14.407 , 18.365 , 15.101 , 14.132 , 13.553 , 15.097 , 15.949 , 19.857 , 21.114 , 15.53 , 13.241 , 8.98 , 14.363 , 9.54 , 17.208 , 16.593 , 22.087 , 16.24144286, 14.478 , 11.51 , 11.59 , 18.537 , 11.77 , 17.54634286, 23.487 , 14.907 , 20.462 , 12.59 , 18.38 , 12.449 , 23.487 , 20.311 , 19.004 , 19.22 , 9.719 , 13.476 , 12.333 , 11.08 , 22.368 , 14.67 , 17.966 ])

df_crashes['pred_rf'] = model_rf.predict(X=features)

Model's Score

model_rf.score(X=features, y=target)

0.9549469198566546

Let's create a dictionary that stores the Score of each model:

dic_scores = {}dic_scores['lr'] = model_lr.score(X=features, y=target)dic_scores['rf'] = model_rf.score(X=features, y=target)

`SVR()` in Python

Fit the Model

from sklearn.svm import SVRmodel_sv = SVR()model_sv.fit(X=features, y=target)

SVR()

Calculate Predictions

model_sv.predict(X=features)

array([18.29570777, 15.18462721, 17.2224187 , 18.6633175 , 12.12434781, 13.10691581, 13.31612684, 16.21131216, 12.66062465, 17.17537208, 13.34820949, 19.38920329, 14.91415215, 14.65467023, 14.2131504 , 13.41560202, 14.41299448, 16.39752499, 19.4896662 , 15.20002787, 13.62200798, 11.5390483 , 13.47824339, 11.49818909, 17.87053595, 17.9144274 , 19.60736085, 17.24170425, 15.73585463, 12.35136579, 11.784815 , 16.53431108, 12.53373232, 17.2224187 , 19.4773929 , 16.01115736, 18.56379706, 12.06891287, 18.30002795, 14.25171609, 19.59597679, 19.37950461, 18.32794218, 19.29994413, 12.26345665, 13.84847453, 12.25128025, 12.38791686, 19.48212198, 15.27397732, 18.1357253 ])

df_crashes['pred_sv'] = model_sv.predict(X=features)

Model's Score

model_sv.score(X=features, y=target)

0.7083438012012769

dic_scores['sv'] = model_sv.score(X=features, y=target)

💪 Which One Is the Best? Why?

The best model is the Random Forest with a Score of 0.95:

pd.Series(dic_scores).sort_values(ascending=False)

rf 0.954947 lr 0.726949 sv 0.708344 dtype: float64

📊 Visualise the 3 Models

Let's put the following data:

df_crashes[['accidents', 'pred_lr', 'pred_rf', 'pred_sv']]

Into a plot:

sns.scatterplot(x='alcohol', y='accidents', data=df_crashes, label='Real Data')sns.scatterplot(x='alcohol', y='pred_lr', data=df_crashes, label='Linear Regression')sns.lineplot(x='alcohol', y='pred_lr', data=df_crashes, color='orange')sns.scatterplot(x='alcohol', y='pred_rf', data=df_crashes, label='Random Forest')sns.scatterplot(x='alcohol', y='pred_sv', data=df_crashes, label='Support Vector Machines');

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

#03 | Grouping & Pivot Tables

Jesús López — Fri, 22 Jul 2022 14:41:28 GMT

Jess Lpez 2022

Ask him any doubt on Twitter or LinkedIn

Possibilities

Look at the following example as an aspiration you can achieve if you fully understand and replicate this whole tutorial with your data.

Let's load a dataset that contains information from transactions in tables (rows) at a restaurant considering socio-demographic and economic variables (columns).

import seaborn as snsdf_tips = sns.load_dataset('tips')df_tips

Grouping data to summarise the information helps you identify conclusions. For example, the summary below shows that Dinners on Sundays come to the best customers because they:

Spend more on average ($21.41)
Give more tips on average ($3.25)
Come more people at the same table on average ($2.84)

df_tips.groupby(by=['day', 'time'])\    .mean()\    .fillna(0)\    .style.format('{:.2f}').background_gradient(axis=0)

df_tips.groupby(by=['day', 'time'])\    .mean()\    .fillna(0)\    .style.format('{:.2f}').bar(axis=0, width=50, align='zero')

Let's dig into the details of the .groupby() function from the basics in the following sections.

Grouping by 1 Column

We use the .groupby() function to generate an object that contains as many DataFrames as categories are in the column.

df_tips.groupby('sex')

As we have two groups in sex (Female and Male), the length of the DataFrameGroupBy object returned by the groupby() function is 2:

len(df_tips.groupby('sex'))

How can we work with the object DataFrameGroupBy?

Calculate the Average for All Columns

We use the .mean() function to get the average of the numerical columns for the two groups:

df_tips.groupby('sex').mean()

A pretty and simple syntax to summarise the information, right?

But what's going on inside the DataFrameGroupBy object?

df_tips.groupby('sex')

df_grouped = df_tips.groupby('sex')

The DataFrameGroupBy object contains 2 DataFrames. To see one of them DataFrame you need to use the function .get_group() and pass the group whose DataFrame you'd like to return:

df_grouped.get_group('Male')

df_grouped.get_group('Female')

Learn how to become an independent Data Analyist programmer who knows how to extract meaningful insights from Data Visualizations.

As the DataFrameGroupBy distinguish the categories, at the moment we apply an aggregation function (click here to see a list of them), we will get the mathematical operations for those groups separately:

df_grouped.mean()

We could apply the function to each DataFrame separately. Although it is not the point of the .groupby() function.

df_grouped.get_group('Male').mean(numeric_only=True)

df_grouped.get_group('Female').mean(numeric_only=True)

Compute Functions to 1 Column

To get the results for just 1 column of interest, we access the column:

df_grouped.total_bill

And use the aggregation function we wish, .sum() in this case:

df_grouped.total_bill.sum()

We get the result for just 1 column (total_bill) because the DataFrames generated at .groupby() are accessed as if they were simple DataFrames:

df_grouped.get_group('Female')

df_grouped.get_group('Female').total_bill

df_grouped.get_group('Female').total_bill.sum()

df_grouped.get_group('Male').total_bill.sum()

df_grouped.total_bill.sum()

Grouping by 2 Columns

So far, we have summarised the data based on the categories of just one column. But, what if we'd like to summarise the data based on the combinations of the categories within different categorical columns?

Compute 1 Function

df_tips.groupby(by=['day', 'smoker']).sum()

Pivot Tables

We could have also used another function .pivot_table() to get the same numbers:

df_tips.pivot_table(index='day', columns='smoker', aggfunc='sum')

Which one is best?

I leave it up to your choice; I'd prefer to use the .pivot_table() because the syntax makes it more accessible.

Compute More than 1 Function

The thing doesn't stop here; we can even compute different aggregation functions at the same time:

Groupby

df_tips.groupby(by=['day', 'smoker'])\    .total_bill\    .agg(func=['sum', 'mean'])

Pivot Table

df_tips.pivot_table(index='day', columns='smoker',                    values='total_bill', aggfunc=['sum', 'mean'])

dfres = df_tips.pivot_table(index='day', columns='smoker',                    values='total_bill', aggfunc=['sum', 'mean'])

You could even style the output DataFrame:

dfres.style.background_gradient()

For me, it's nicer than styling the .groupby() returned DataFrame.

As we say in Spain:

Pa' gustos los colores!

df_tips.groupby(by=['day', 'smoker']).total_bill.agg(func=['sum', 'mean'])

dfres = df_tips.groupby(by=['day', 'smoker']).total_bill.agg(func=['sum', 'mean'])

dfres.style.background_gradient()

Pivot Tables in Depth

We can compute more than one mathematical operation:

df_tips.pivot_table(index='sex', columns='time',                    aggfunc=['sum', 'mean'], values='total_bill')

And use more than one column in each of the parameters:

df_tips.pivot_table(index='sex', columns='time',                    aggfunc=['sum', 'mean'], values=['total_bill', 'tip'])

df_tips.pivot_table(index=['day', 'smoker'], columns='time',                    aggfunc=['sum', 'mean'], values=['total_bill', 'tip'])

df_tips.pivot_table(index=['day', 'smoker'], columns=['time', 'sex'],                    aggfunc=['sum', 'mean'], values=['total_bill', 'tip'])

The `.size()` Function

`.groupby()`

1 Variable to Group By

The .size() is a function used to count the number of rows (observations) in each of the DataFrames generated by .groupby().

df_grouped.size()

2 Variables to Group By

df_tips.groupby(by=['sex', 'time']).size()

`.pivot_table()`

We can use .pivot_table() to represent the data clearer:

df_tips.pivot_table(index='sex', columns='time', aggfunc='size')

Other Example 1

df_tips.pivot_table(index='smoker', columns=['day', 'sex'],aggfunc='size')

dfres = df_tips.pivot_table(index='smoker', columns=['day', 'sex'], aggfunc='size')

dfres.style.background_gradient()

Other Example 2

df_tips.pivot_table(index=['day', 'time'], columns=['smoker', 'sex'], aggfunc='size')

dfres = df_tips.pivot_table(index=['day', 'time'], columns=['smoker', 'sex'], aggfunc='size')

dfres.style.background_gradient()

We can even choose the way we'd like to gradient colour the cells:

axis=1: the upper value between the columns of the same row
axis=2: the upper value between the rows of the same column

dfres.style.background_gradient(axis=1)

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

#02 | Load Data from APIs to a Pandas DataFrame in Python

Jesús López — Mon, 18 Jul 2022 19:19:20 GMT

Jess Lpez 2022

Ask him any doubt on Twitter or LinkedIn

Introduction

The following image is pretty self-explanatory to understand how APIs work:

The API is the waiter who
Takes the request from the clients
And take them to the kitchen
To later serve the "cooked" response back to the clients

The Uniform Resource Locator (URL)

The URL is an address we use to locate files on the Internet:

Documents: pdf, ppt, docx,...
Multimedia: mp4, mp3, mov, png, jpeg,...
Data Files: csv, json, db,...

Check out the following gif where we inspect the resources we download when locating https://economist.com.

URL - Watch Video

The API

An Application Program Interface (API) is a communications tool between the client and the server to carry out information through an URL.

The API defines the rules by which the URL will work. Like Python, the API contains:

Functions
Parameters
Accepted Values

The only extra knowledge we need to consider is the use of tokens.

A token is a code you use in the request to validate your identity, as most platforms charge money to use their API.

Get a token from AlphaVantage and store it into a Python variable.

token = 'PASTE_YOUR_TOKEN_HERE'

Look for an API Call Example

In the website documentation.

'https://www.alphavantage.co/query?function=TIME_SERIES_INTRADAY&symbol=IBM&interval=5min&apikey=demo'

The API's Response

Every time you make a call to an API requesting some information, you later receive a response.

Check this JSON, a type of file that stores structured data returned by the API.

If you want to know more about the JSON file, see article.

The pattern:

Base API: https://www.alphavantage.co/query?
Parameters:
- function=TIME_SERIES_INTRADAY
- symbol=IBM
- interval=5min
- apikey=demo

API's Data Response to Python

Could you request the file from Python?

import requestsapi_call = 'https://www.alphavantage.co/query?function=TIME_SERIES_INTRADAY&symbol=IBM&interval=5min&apikey=demo'requests.get(url=api_call)

>>> <Response [200]>

res = requests.get(url=api_call)

The function returns an object containing all the information related to the API request and response.

res.apparent_encoding

>>> 'ascii'

res.headers

>>> {'Date': 'Mon, 18 Jul 2022 18:01:19 GMT', 'Content-Type': 'application/json', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Cookie', 'X-Frame-Options': 'SAMEORIGIN', 'Allow': 'GET, HEAD, OPTIONS', 'Via': '1.1 vegur', 'CF-Cache-Status': 'DYNAMIC', 'Expect-CT': 'max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"', 'Server': 'cloudflare', 'CF-RAY': '72cd1f3959323851-MAD', 'Content-Encoding': 'gzip'}

res.history

>>> []

To place the response object into a Python interpretable object, we need to use the function .json() to get a dictionary with the data.

res.json()

>>> {'Meta Data': {'1. Information': 'Intraday (5min) open, high, low, close prices and volume',  '2. Symbol': 'IBM',  '3. Last Refreshed': '2022-06-29 19:25:00',  '4. Interval': '5min',  '5. Output Size': 'Compact',  '6. Time Zone': 'US/Eastern'}, 'Time Series (5min)': {'2022-06-29 19:25:00': {'1. open': '140.7100',   '2. high': '140.7100',   '3. low': '140.7100',   '4. close': '140.7100',   '5. volume': '531'},   ...  '2022-06-28 17:25:00': {'1. open': '142.1500',   '2. high': '142.1500',   '3. low': '142.1500',   '4. close': '142.1500',   '5. volume': '100'}}}

data = res.json()

The data in the dictionary represents the symbol IBM in intervals of 5min for the TIME_SERIES_INTRADAY.

Check the dictionary above to confirm.

res.request.path_url

>>> '/query?function=TIME_SERIES_INTRADAY&symbol=IBM&interval=5min&apikey=demo'

What can we change to get the information about the Apple Stock (AAPL)?

We need to change the value of the parameter symbol within the URL we use to call the API:

stock = 'AAPL'api_call = f'https://www.alphavantage.co/query?function=TIME_SERIES_INTRADAY&symbol={stock}&interval=5min&apikey=demo'res = requests.get(url=api_call)res.json()

>>> {'Information': 'The **demo** API key is for demo purposes only. Please claim your free API key at (https://www.alphavantage.co/support/#api-key) to explore our full API offerings. It takes fewer than 20 seconds.'}

Why is not displaying the information of the Apple Stock? How can you solve the problem?

The API returns a JSON which implicitly says we previously used a *demo API key* to retrieve data from the symbol IBM. Nevertheless, using the same demo API key to retrieve the AAPL stock data is impossible.

We should include our token in the API call:

token

>>> 'YOUR_PASTED_TOKEN_ABOVE'

api_call = f'https://www.alphavantage.co/query?function=TIME_SERIES_INTRADAY&symbol={stock}&interval=5min&apikey={token}'res = requests.get(url=api_call)data = res.json()data

>>> {'Meta Data': {'1. Information': 'Intraday (5min) open, high, low, close prices and volume',  '2. Symbol': 'AAPL',  '3. Last Refreshed': '2022-07-15 20:00:00',  '4. Interval': '5min',  '5. Output Size': 'Compact',  '6. Time Zone': 'US/Eastern'}, 'Time Series (5min)': {'2022-06-29 19:25:00': {'1. open': '140.7100',   '2. high': '140.7100',   '3. low': '140.7100',   '4. close': '140.7100',   '5. volume': '531'},   ...  '2022-06-28 17:25:00': {'1. open': '142.1500',   '2. high': '142.1500',   '3. low': '142.1500',   '4. close': '142.1500',   '5. volume': '100'}}}

Can we make plots and mathematical operations with the object `data`? Why?

data contains a dictionary, which it's a very simple Python object.

data.sum()

>>>---------------------------------------------------------------------------AttributeError                            Traceback (most recent call last)Input In [46], in <cell line: 1>()----> 1 data.sum()AttributeError: 'dict' object has no attribute 'sum'

API's Data Response to a DataFrame

We need to create a DataFrame out of this dictionary to have a powerful object we could use to apply many functions.

import dataframe_image as dfi

import pandas as pdpd.DataFrame(data=data)

Filter the Information in the Response

We'd like to have the open, high, close,... variables as the columns. Not Meta Data and Time Series (5min). Why is this happening?

Meta Data and Time Series (5min) are the keys of the dictionary data.
The value of the key Time Series (5min) key is the information we want in the DataFrame.

data['Time Series (5min)']

>>> {'2022-07-15 20:00:00': {'1. open': '150.0300',  '2. high': '150.0700',  '3. low': '150.0300',  '4. close': '150.0300',  '5. volume': '4752'},  ... '2022-06-28 17:25:00': {'1. open': '142.1500',  '2. high': '142.1500',  '3. low': '142.1500',  '4. close': '142.1500',  '5. volume': '100'}

pd.DataFrame(data['Time Series (5min)'])

df_apple = pd.DataFrame(data['Time Series (5min)'])

Preprocess the DataFrame

The DataFrame is not represented as we'd like because the Dates are in the columns and the variables are in the index. So which function can we use to transpose the DataFrame?

df_apple.transpose()

df_apple = df_apple.transpose()

Let's get the average value from the close price:

df_apple['4. close']

>>> 2022-07-15 20:00:00    150.0300    2022-07-15 19:55:00    150.0700                            ...       2022-07-15 11:45:00    149.1500    2022-07-15 11:40:00    149.1100    Name: 4. close, Length: 100, dtype: object

df_apple['4. close'].mean()

>>>---------------------------------------------------------------------------ValueError                                Traceback (most recent call last)File ~/miniforge3/lib/python3.9/site-packages/pandas/core/nanops.py:1622, in _ensure_numeric(x)   1621 try:-> 1622     x = float(x)   1623 except (TypeError, ValueError):   1624     # e.g. "1+1j" or "foo"ValueError: could not convert string to float: '150.0300150.0700150.0400150.0100150.0300150.0500149.9900149.9900149.9800149.9900150.0000149.9900150.0000149.9900150.0000149.9800150.0000150.0100150.0500150.0100150.0100150.0000150.0200150.0100150.0100150.0098150.0100150.0000150.0200150.0000150.0007150.0100150.0100150.0200150.0325150.0200150.0300150.0200150.0000150.0300150.0001150.0000150.0000150.0100150.0560150.0500150.0900150.1700149.8900149.4410149.5300149.2700149.2160149.2094149.2000149.3450149.3778149.5450149.3600149.3500149.4700149.5400149.3993149.2150149.3015149.4100149.2916149.2650149.1200149.0400148.9800149.1350148.8800149.1850149.3924149.4600149.3496149.3250149.0874149.0600149.0000149.0101148.9350148.9100148.8620149.0050148.8100148.6340148.5500148.7600148.6950148.6800148.5488148.3500148.7351148.7910148.9305149.2000149.1500149.1100'During handling of the above exception, another exception occurred:ValueError                                Traceback (most recent call last)File ~/miniforge3/lib/python3.9/site-packages/pandas/core/nanops.py:1626, in _ensure_numeric(x)   1625 try:-> 1626     x = complex(x)   1627 except ValueError as err:   1628     # e.g. "foo"ValueError: complex() arg is a malformed stringThe above exception was the direct cause of the following exception:TypeError                                 Traceback (most recent call last)Input In [38], in  line: 1>()----> 1 df_apple['4. close'].mean()File ~/miniforge3/lib/python3.9/site-packages/pandas/core/generic.py:11117, in NDFrame._add_numeric_operations..mean(self, axis, skipna, level, numeric_only, **kwargs)  11099 @doc(  11100     _num_doc,  11101     desc="Return the mean of the values over the requested axis.",   (...)  11115     **kwargs,  11116 ):> 11117     return NDFrame.mean(self, axis, skipna, level, numeric_only, **kwargs)File ~/miniforge3/lib/python3.9/site-packages/pandas/core/generic.py:10687, in NDFrame.mean(self, axis, skipna, level, numeric_only, **kwargs)  10679 def mean(  10680     self,  10681     axis: Axis | None | lib.NoDefault = lib.no_default,   (...)  10685     **kwargs,  10686 ) -> Series | float:> 10687     return self._stat_function(  10688         "mean", nanops.nanmean, axis, skipna, level, numeric_only, **kwargs  10689     )File ~/miniforge3/lib/python3.9/site-packages/pandas/core/generic.py:10639, in NDFrame._stat_function(self, name, func, axis, skipna, level, numeric_only, **kwargs)  10629     warnings.warn(  10630         "Using the level keyword in DataFrame and Series aggregations is "  10631         "deprecated and will be removed in a future version. Use groupby "   (...)  10634         stacklevel=find_stack_level(),  10635     )  10636     return self._agg_by_level(  10637         name, axis=axis, level=level, skipna=skipna, numeric_only=numeric_only  10638     )> 10639 return self._reduce(  10640     func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only  10641 )File ~/miniforge3/lib/python3.9/site-packages/pandas/core/series.py:4471, in Series._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)   4467     raise NotImplementedError(   4468         f"Series.{name} does not implement {kwd_name}."   4469     )   4470 with np.errstate(all="ignore"):-> 4471     return op(delegate, skipna=skipna, **kwds)File ~/miniforge3/lib/python3.9/site-packages/pandas/core/nanops.py:93, in disallow.__call__.._f(*args, **kwargs)     91 try:     92     with np.errstate(invalid="ignore"):---> 93         return f(*args, **kwargs)     94 except ValueError as e:     95     # we want to transform an object array     96     # ValueError message to the more typical TypeError     97     # e.g. this is normally a disallowed function on     98     # object arrays that contain strings     99     if is_object_dtype(args[0]):File ~/miniforge3/lib/python3.9/site-packages/pandas/core/nanops.py:155, in bottleneck_switch.__call__..f(values, axis, skipna, **kwds)    153         result = alt(values, axis=axis, skipna=skipna, **kwds)    154 else:--> 155     result = alt(values, axis=axis, skipna=skipna, **kwds)    157 return resultFile ~/miniforge3/lib/python3.9/site-packages/pandas/core/nanops.py:410, in _datetimelike_compat..new_func(values, axis, skipna, mask, **kwargs)    407 if datetimelike and mask is None:    408     mask = isna(values)--> 410 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)    412 if datetimelike:    413     result = _wrap_results(result, orig_values.dtype, fill_value=iNaT)File ~/miniforge3/lib/python3.9/site-packages/pandas/core/nanops.py:698, in nanmean(values, axis, skipna, mask)    695     dtype_count = dtype    697 count = _get_counts(values.shape, mask, axis, dtype=dtype_count)--> 698 the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum))    700 if axis is not None and getattr(the_sum, "ndim", False):    701     count = cast(np.ndarray, count)File ~/miniforge3/lib/python3.9/site-packages/pandas/core/nanops.py:1629, in _ensure_numeric(x)   1626             x = complex(x)   1627         except ValueError as err:   1628             # e.g. "foo"-> 1629             raise TypeError(f"Could not convert {x} to numeric") from err   1630 return xTypeError: Could not convert 150.0300150.0700150.0400150.0100150.0300150.0500149.9900149.9900149.9800149.9900150.0000149.9900150.0000149.9900150.0000149.9800150.0000150.0100150.0500150.0100150.0100150.0000150.0200150.0100150.0100150.0098150.0100150.0000150.0200150.0000150.0007150.0100150.0100150.0200150.0325150.0200150.0300150.0200150.0000150.0300150.0001150.0000150.0000150.0100150.0560150.0500150.0900150.1700149.8900149.4410149.5300149.2700149.2160149.2094149.2000149.3450149.3778149.5450149.3600149.3500149.4700149.5400149.3993149.2150149.3015149.4100149.2916149.2650149.1200149.0400148.9800149.1350148.8800149.1850149.3924149.4600149.3496149.3250149.0874149.0600149.0000149.0101148.9350148.9100148.8620149.0050148.8100148.6340148.5500148.7600148.6950148.6800148.5488148.3500148.7351148.7910148.9305149.2000149.1500149.1100 to numeric

Why are we getting this ugly error?

The values of the Series aren't numerical objects.

df_apple.dtypes

>>> 1. open      object    2. high      object    3. low       object    4. close     object    5. volume    object    dtype: object

Can you change the type of the values into numerical objects?

df_apple = df_apple.apply(pd.to_numeric)

Now that we have the Series values as numerical objects:

df_apple.dtypes

>>> 1. open      float64    2. high      float64    3. low       float64    4. close     float64    5. volume      int64    dtype: object

We should be able to get the average close price:

df_apple['4. close'].mean()

>>> 149.551566

What else could we do?

df_apple.hist();

df_apple.hist(layout=(2,3), figsize=(15,8));

Recap

token = 'PASTE_YOUR_TOKEN_HERE'stock = 'AAPL'api_call = f'https://www.alphavantage.co/query?function=TIME_SERIES_INTRADAY&symbol={stock}&interval=5min&apikey={token}'res = requests.get(url=api_call)data = res.json()df_apple = pd.DataFrame(data=data['Time Series (5min)'])df_apple = df_apple.transpose()df_apple = df_apple.apply(pd.to_numeric)df_apple.hist(layout=(2,3), figsize=(15,8));

Other Example

size='full'info_type = 'TIME_SERIES_DAILY'api_call = f'https://www.alphavantage.co/query?function={info_type}&symbol={stock}&outputsize={size}&apikey={token}'res = requests.get(url=api_call)data = res.json()df_apple_daily = pd.DataFrame(data['Time Series (Daily)'])df_apple_daily = df_apple_daily.transpose()df_apple_daily = df_apple_daily.apply(pd.to_numeric)df_apple_daily.index = pd.to_datetime(df_apple_daily.index)df_apple_daily.plot.line(layout=(2,3), figsize=(15,8), subplots=True);

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

#01 | Getting Started with Pandas

Jesús López — Mon, 18 Jul 2022 08:04:14 GMT

Introduction

Programming is all about working with data.

We can work with many types of data structures. Nevertheless, the pandas DataFarme is the most useful because it contains functions that automate a lot of work by writing a simple line of code.

This tutorial will teach you how to work with the pandas.DataFrame object.

Before, we will demonstrate why working with simple Arrays (what most people do) makes your life more difficult than it should be.

The Array

An array is any object that can store more than one object. For example, the list:

[100, 134, 87, 99]

Let's say we are talking about the revenue our e-commerce has had over the last 4 months:

list_revenue = [100, 134, 87, 99]

We want to calculate the total revenue (i.e., we sum up the objects within the list):

list_revenue.sum()

---------------------------------------------------------------------------AttributeError                            Traceback (most recent call last)Input In [3], in 1>()----> 1 list_revenue.sum()AttributeError: 'list' object has no attribute 'sum'

The list is a poor object which doesn't contain powerful functions.

What can we do then?

We convert the list to a powerful object such as the Series, which comes from pandas library.

import pandaspandas.Series(list_revenue)

>>>0    1001    1342     873     99dtype: int64

series_revenue = pandas.Series(list_revenue)

Now we have a powerful object that can perform the .sum():

series_revenue.sum()

>>> 420

The Series

Within the Series, we can find more objects.

series_revenue

>>>0    1001    1342     873     99dtype: int64

The index

series_revenue.index

>>> RangeIndex(start=0, stop=4, step=1)

Let's change the elements of the index:

series_revenue.index = ['1st Month', '2nd Month', '3rd Month', '4th Month']

series_revenue

>>>1st Month    1002nd Month    1343rd Month     874th Month     99dtype: int64

The values

series_revenue.values

>>> array([100, 134,  87,  99])

The name

series_revenue.name

The Series doesn't contain a name. Let's define it:

series_revenue.name = 'Revenue'

series_revenue

>>>1st Month    1002nd Month    1343rd Month     874th Month     99Name: Revenue, dtype: int64

The dtype

The values of the Series (right-hand side) are determined by their data type (alias dtype):

series_revenue.dtype

>>> dtype('float64')

Let's change the values' dtype to be float (decimal numbers)

series_revenue.astype(float)

>>>1st Month    100.02nd Month    134.03rd Month     87.04th Month     99.0Name: Revenue, dtype: float64

series_revenue = series_revenue.astype(float)

Awesome Functions 😎

What else could we do with the Series object?

series_revenue.describe()

>>>count      4.000000mean     105.000000std       20.215506min       87.00000025%       96.00000050%       99.50000075%      108.500000max      134.000000Name: Revenue, dtype: float64

series_revenue.plot.bar();

series_revenue.plot.barh();

series_revenue.plot.pie();

The DataFrame

The DataFrame is a set of Series.

We will create another Series series_expenses to later put them together into a DataFrame.

pandas.Series(    data=[20, 23, 21, 18],    index=['1st Month','2nd Month','3rd Month','4th Month'],    name='Expenses')

>>>1st Month    202nd Month    233rd Month    214th Month    18Name: Expenses, dtype: int64

series_expenses = pandas.Series(    data=[20, 23, 21, 18],    index=['1st Month','2nd Month','3rd Month','4th Month'],    name='Expenses')

pandas.DataFrame(data=[series_revenue, series_expenses])

df_shop = pandas.DataFrame(data=[series_revenue, series_expenses])

Let's transpose the DataFrame to have the variables in columns:

df_shop.transpose()

df_shop = df_shop.transpose()

The index

df_shop.index

>>> Index(['1st Month', '2nd Month', '3rd Month', '4th Month'], dtype='object')

The columns

df_shop.columns

>>> Index(['Revenue', 'Expenses'], dtype='object')

The values

df_shop.values

>>>array([[100.,  20.],       [134.,  23.],       [ 87.,  21.],       [ 99.,  18.]])

The shape

df_shop.shape

>>> (4, 2)

Awesome Functions 😎

What else could we do with the DataFrame object?

df_shop.describe()

df_shop.plot.bar();

df_shop.plot.pie(subplots=True);

df_shop.plot.line();

df_shop.plot.area();

We could also export the DataFrame to formatted data files:

df_shop.to_excel('data.xlsx')

df_shop.to_csv('data.csv')

Reading Data Tables from Files

JSON

Football Players

url = 'https://raw.githubusercontent.com/jsulopzs/data/main/football_players_stats.json'pandas.read_json(url, orient='index')

df_football = pandas.read_json(url, orient='index')

df_football.Goals.plot.pie();

Tennis Players

url = 'https://raw.githubusercontent.com/jsulopzs/data/main/best_tennis_players_stats.json'pandas.read_json(path_or_buf=url, orient='index')

df_tennis = pandas.read_json(path_or_buf=url, orient='index')

df_tennis.style.background_gradient()

df_tennis.plot.pie(subplots=True, layout=(2,3), figsize=(10,6));

HTML Web Page

pandas.read_html('https://www.skysports.com/la-liga-table/2021', index_col='Team')[0]

df_laliga = pandas.read_html('https://www.skysports.com/la-liga-table/2021', index_col='Team')[0]

df_laliga.Pts.plot.barh();

df_laliga.Pts.sort_values().plot.barh();

CSV

url = 'https://raw.githubusercontent.com/jsulopzs/data/main/internet_usage_spain.csv'pandas.read_csv(filepath_or_buffer=url)

df_internet = pandas.read_csv(filepath_or_buffer=url)

df_internet.hist();

df_internet.pivot_table(index='education', columns='internet_usage', aggfunc='size')

dfres = df_internet.pivot_table(index='education', columns='internet_usage', aggfunc='size')

dfres.style.background_gradient('Greens', axis=1)

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

#02 | The Decision Tree Classifier & Supervised Classification Models

Jesús López — Fri, 06 May 2022 14:29:08 GMT

Jess Lpez 2022

Don't miss out on his posts on LinkedIn to become a more efficient Python developer.

Introduction to Supervised Classification Models

Machine Learning is a field that focuses on getting a mathematical equation to make predictions. Although not all Machine Learning models work the same way.

Which types of Machine Learning models can we distinguish so far?

Classifiers to predict Categorical Variables
Regressors to predict Numerical Variables

The previous chapter covered the explanation of a Regressor model: Linear Regression.

This chapter covers the explanation of a Classification model: the Decision Tree.

Why do they belong to Machine Learning?

The Machine wants to get the best numbers of a mathematical equation such that the difference between reality and predictions is minimum:
- Classifier evaluates the model based on prediction success rate y=?y^
- Regressor evaluates the model based on the distance between real data and predictions (residuals) yy^

There are many Machine Learning Models of each type.

You don't need to know the process behind each model because they all work the same way (see article). In the end, you will choose the one that makes better predictions.

This tutorial will show you how to develop a Decision Tree to calculate the probability of a person surviving the Titanic and the different evaluation metrics we can calculate on Classification Models.

Table of Important Content

🛀 How to preprocess/clean the data to fit a Machine Learning model?
- Dummy Variables
- Missing Data
🤩 How to visualize a Decision Tree model in Python step by step?
🤔 How to interpret the nodes and leaf's values of a Decision Tree plot?
How to evaluate Classification models?
- Accuracy
- Confussion Matrix
  - Sensitivity
  - Specificity
  - ROC Curve
🏁 How to compare Classification models to choose the best one?

Load the Data

This dataset represents people (rows) aboard the Titanic
And their sociological characteristics (columns)

import seaborn as sns #!import pandas as pddf_titanic = sns.load_dataset(name='titanic')[['survived', 'sex', 'age', 'embarked', 'class']]df_titanic

How do we compute a Decision Tree Model in Python?

We should know from the previous chapter that we need a function accessible from a Class in the library sklearn.

Import the Class

from sklearn.tree import DecisionTreeClassifier

Instantiante the Class

To create a copy of the original's code blueprint to not "modify" the source code.

model_dt = DecisionTreeClassifier()

Access the Function

The theoretical action we'd like to perform is the same as we executed in the previous chapter. Therefore, the function should be called the same way:

model_dt.fit()

---------------------------------------------------------------------------

TypeError Traceback (most recent call last)

/var/folders/24/tg28vxls25l9mjvqrnh0plc80000gn/T/ipykernel_3553/3699705032.py in ----> 1 model_dt.fit()

TypeError: fit() missing 2 required positional arguments: 'X' and 'y'

Why is it asking for two parameters: y and X?

y: target ~ independent ~ label ~ class variable
X: explanatory ~ dependent ~ feature variables

Separate the Variables

target = df_titanic['survived']explanatory = df_titanic.drop(columns='survived')

Fit the Model

model_dt.fit(X=explanatory, y=target)

---------------------------------------------------------------------------

ValueError: could not convert string to float: 'male'

Most of the time, the data isn't prepared to fit the model. So let's dig into why we got the previous error in the following sections.

Data Preprocessing

The error says:

ValueError: could not convert string to float: 'male'

From which we can interpret that the function .fit() does not accept values of string type like the ones in sex column:

df_titanic

Dummy Variables

Therefore, we need to convert the categorical columns to dummies (0s & 1s):

pd.get_dummies(df_titanic, drop_first=True)

df_titanic = pd.get_dummies(df_titanic, drop_first=True)

We separate the variables again to take into account the latest modification:

explanatory = df_titanic.drop(columns='survived')target = df_titanic[['survived']]

Fit the Model Again

Now we should be able to fit the model:

model_dt.fit(X=explanatory, y=target)

---------------------------------------------------------------------------

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

Missing Data

The data passed to the function contains missing data (NaN). Precisely 177 people from which we don't have the age:

df_titanic.isna()

df_titanic.isna().sum()

survived 0 age 177 sex_male 0 embarked_Q 0 embarked_S 0 class_Second 0 class_Third 0 dtype: int64

Who are the people who lack the information?

mask_na = df_titanic.isna().sum(axis=1) > 0

df_titanic[mask_na]

What could we do with them?

Drop the people (rows) who miss the age from the dataset.
Fill the age by the average age of other combinations (like males who survived)
Apply an algorithm to fill them.

We'll choose option 1 to simplify the tutorial.

Therefore, we go from 891 people:

df_titanic

To 714 people:

df_titanic.dropna()

df_titanic = df_titanic.dropna()

We separate the variables again to take into account the latest modification:

explanatory = df_titanic.drop(columns='survived')target = df_titanic['survived']

Now we shouldn't have any more trouble with the data to fit the model.

Fit the Model Again

We don't get any errors because we correctly preprocess the data for the model.

Once the model is fitted, we may observe that the object contains more attributes because it has calculated the best numbers for the mathematical equation.

model_dt.fit(X=explanatory, y=target)model_dt.__dict__

{'criterion': 'gini', 'splitter': 'best', 'max_depth': None, 'min_samples_split': 2, 'min_samples_leaf': 1, 'min_weight_fraction_leaf': 0.0, 'max_features': None, 'max_leaf_nodes': None, 'random_state': None, 'min_impurity_decrease': 0.0, 'class_weight': None, 'ccp_alpha': 0.0, 'feature_names_in_': array(['age', 'sex_male', 'embarked_Q', 'embarked_S', 'class_Second', 'class_Third'], dtype=object), 'n_features_in_': 6, 'n_outputs_': 1, 'classes_': array([0, 1]), 'n_classes_': 2, 'max_features_': 6, 'tree_': }

Learn how to become an independent Machine Learning programmer who knows when to apply any ML algorithm to any dataset.

Predictions

Calculate Predictions

We have a fitted DecisionTreeClassifier. Therefore, we should be able to apply the mathematical equation to the original data to get the predictions:

model_dt.predict_proba(X=explanatory)[:5]

array([[0.82051282, 0.17948718], [0.05660377, 0.94339623], [0.53921569, 0.46078431], [0.05660377, 0.94339623], [0.82051282, 0.17948718]])

Add a New Column with the Predictions

Let's create a new DataFrame to keep the information of the target and predictions to understand the topic better:

df_pred = df_titanic[['survived']].copy()

And add the predictions:

df_pred['pred_proba_dt'] = model_dt.predict_proba(X=explanatory)[:,1]df_pred

How have we calculated those predictions?

Model Visualization

The Decision Tree model doesn't specifically have a mathematical equation. But instead, a set of conditions is represented in a tree:

from sklearn.tree import plot_treeplot_tree(decision_tree=model_dt);

There are many conditions; let's recreate a shorter tree to explain the Mathematical Equation of the Decision Tree:

model_dt = DecisionTreeClassifier(max_depth=2)model_dt.fit(X=explanatory, y=target)plot_tree(decision_tree=model_dt);

Let's make the image bigger:

import matplotlib.pyplot as pltplt.figure(figsize=(10,6))plot_tree(decision_tree=model_dt);

The conditions are X[2]<=0.5. The X[2] means the 3rd variable (Python starts counting at 0) of the explanatory ones. If we'd like to see the names of the columns, we need to add the feature_names parameter:

explanatory.columns

Index(['age', 'sex_male', 'embarked_Q', 'embarked_S', 'class_Second', 'class_Third'], dtype='object')

import matplotlib.pyplot as pltplt.figure(figsize=(10,6))plot_tree(decision_tree=model_dt, feature_names=explanatory.columns);

Let's add some colours to see how the predictions will go based on the fulfilled conditions:

import matplotlib.pyplot as pltplt.figure(figsize=(10,6))plot_tree(decision_tree=model_dt, feature_names=explanatory.columns, filled=True);

How does the Decision Tree Algorithm computes the Mathematical Equation?

The Decision Tree and the Linear Regression algorithms look for the best numbers in a mathematical equation. The following video explains how the Decision Tree configures the equation:

https://www.youtube.com/watch?v=_L39rN6gz7Y

Model Interpretation

Let's take a person from the data to explain how the model makes a prediction. For storytelling, let's say the person's name is John.

John is a 22-year-old man who took the titanic on 3rd class but didn't survive:

df_titanic[:1]

To calculate the chances of survival in a person like John, we pass the explanatory variables of John:

explanatory[:1]

To the function .predict_proba() and get a probability of 17.94%:

model_dt.predict_proba(X=explanatory[:1])

array([[0.82051282, 0.17948718]])

But wait, how did we get to the probability of survival of 17.94%?

Let's explain it step-by-step with the Decision Tree visualization:

plt.figure(figsize=(10,6))plot_tree(decision_tree=model_dt, feature_names=explanatory.columns, filled=True);

Based on the tree, the conditions are:

1st condition

sex_male (John=1) <= 0.5 ~ False

John doesn't fulfil the condition; we move to the right side of the tree.

2nd condition

age (John=22.0) <= 6.5 ~ False

John doesn't fulfil the condition; we move to the right side of the tree.

Leaf

The ultimate node, the leaf, tells us that the training dataset contained 429 males older than 6.5 years old.

Out of the 429, 77 survived, but 352 didn't make it.

Therefore, the chances of John surviving according to our model are 77 divided by 429:

77/429

0.1794871794871795

We get the same probability; John had a 17.94% chance of surviving the Titanic accident.

Model's Score

Calculate the Score

As always, we should have a function to calculate the goodness of the model:

model_dt.score(X=explanatory, y=target)

0.8025210084033614

The model can correctly predict 80.25% of the people in the dataset.

What's the reasoning behind the model's evaluation?

The Score Step-by-step

As we saw earlier, the classification model calculates the probability for an event to occur. The function .predict_proba() gives us two probabilities in the columns: people who didn't survive (0) and people who survived (1).

model_dt.predict_proba(X=explanatory)[:5]

array([[0.82051282, 0.17948718], [0.05660377, 0.94339623], [0.53921569, 0.46078431], [0.05660377, 0.94339623], [0.82051282, 0.17948718]])

We take the positive probabilities in the second column:

df_pred['pred_proba_dt'] = model_dt.predict_proba(X=explanatory)[:, 1]

At the time to compare reality (0s and 1s) with the predictions (probabilities), we need to turn probabilities higher than 0.5 into 1, and 0 otherwise.

import numpy as npdf_pred['pred_dt'] = np.where(df_pred.pred_proba_dt > 0.5, 1, 0)df_pred

The simple idea of the accuracy is to get the success rate on the classification: how many people do we get right?

We compare if the reality is equal to the prediction:

comp = df_pred.survived == df_pred.pred_dtcomp

0 True 1 True ...
889 False 890 True Length: 714, dtype: bool

If we sum the boolean Series, Python will take True as 1 and 0 as False to compute the number of correct classifications:

comp.sum()

573

We get the score by dividing the successes by all possibilities (the total number of people):

comp.sum()/len(comp)

0.8025210084033614

It is also correct to do the mean on the comparisons because it's the sum divided by the total. Observe how you get the same number:

comp.mean()

0.8025210084033614

But it's more efficient to calculate this metric with the function .score():

model_dt.score(X=explanatory, y=target)

0.8025210084033614

The Confusion Matrix to Compute Other Classification Metrics

Can we think that our model is 80.25% of good and be happy with it?

We should not because we might be interested in the accuracy of each class (survived or not) separately. But first, we need to compute the confusion matrix:

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplaycm = confusion_matrix(    y_true=df_pred.survived,    y_pred=df_pred.pred_dt)CM = ConfusionMatrixDisplay(cm)CM.plot();

Looking at the first number of the confusion matrix, we have 407 people who didn't survive the Titanic in reality and the predictions.
It is not the case with the number 17. Our model classified 17 people as survivors when they didn't.
The success rate of the negative class, people who didn't survive, is called the specificity: $407/(407+17)$.
Whereas the success rate of the positive class, people who did survive, is called the sensitivity: $166/(166+124)$.

Specificity (Recall=0)

cm[0,0]

407

cm[0,:]

array([407, 17])

cm[0,0]/cm[0,:].sum()

0.9599056603773585

sensitivity = cm[0,0]/cm[0,:].sum()

Sensitivity (Recall=1)

cm[1,1]

166

cm[1,:]

array([124, 166])

cm[1,1]/cm[1,:].sum()

0.5724137931034483

sensitivity = cm[1,1]/cm[1,:].sum()

Classification Report

We could have gotten the same metrics using the function classification_report(). Look a the recall (column) of rows 0 and 1, specificity and sensitivity, respectively:

from sklearn.metrics import classification_reportreport = classification_report(    y_true=df_pred.survived,    y_pred=df_pred.pred_dt)print(report)

precision recall f1-score support

0 0.77 0.96 0.85 424 1 0.91 0.57 0.70 290

accuracy 0.80 714 macro avg 0.84 0.77 0.78 714 weighted avg 0.82 0.80 0.79 714

We can also create a nice DataFrame to later use the data for simulations:

report = classification_report(    y_true=df_pred.survived,    y_pred=df_pred.pred_dt,    output_dict=True)pd.DataFrame(report)

Our model is not as good as we thought if we predict the people who survived; we get 57.24% of survivors right.

How can we then assess a reasonable rate for our model?

ROC Curve

Watch the following video to understand how the Area Under the Curve (AUC) is a good metric because it sort of combines accuracy, specificity and sensitivity:

https://www.youtube.com/watch?v=4jRBRDbJemM

We compute this metric in Python as follows:

import matplotlib.pyplot as pltimport numpy as npfrom sklearn import metricsy = df_pred.survivedpred = model_dt.predict_proba(X=explanatory)[:,1]fpr, tpr, thresholds = metrics.roc_curve(y, pred)roc_auc = metrics.auc(fpr, tpr)display = metrics.RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc,                                  estimator_name='example estimator')display.plot()plt.show()

roc_auc

0.8205066688353937

Other Classification Models

Let's build other classification models by applying the same functions. In the end, computing Machine Learning models is the same thing all the time.

`RandomForestClassifier()` in Python

Fit the Model

from sklearn.ensemble import RandomForestClassifiermodel_rf = RandomForestClassifier()model_rf.fit(X=explanatory, y=target)

RandomForestClassifier()

Calculate Predictions

df_pred['pred_rf'] = model_rf.predict(X=explanatory)df_pred

Model's Score

model_rf.score(X=explanatory, y=target)

0.9117647058823529

`SVC()` in Python

Fit the Model

from sklearn.svm import SVCmodel_sv = SVC()model_sv.fit(X=explanatory, y=target)

SVC()

Calculate Predictions

df_pred['pred_sv'] = model_sv.predict(X=explanatory)df_pred

Model's Score

model_sv.score(X=explanatory, y=target)

0.6190476190476191

Which One Is the Best Model? Why?

To simplify the explanation, we use accuracy as the metric to compare the models. We have the Random Forest as the best model with an accuracy of 91.17%.

model_dt.score(X=explanatory, y=target)

0.8025210084033614

model_rf.score(X=explanatory, y=target)

0.9117647058823529

model_sv.score(X=explanatory, y=target)

0.6190476190476191

df_pred.head(10)

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Django Tutorial | Not a Movie, but a Python Framework to Create Websites

Jesús López — Thu, 06 Jan 2022 17:30:08 GMT

See the code of the tutorial in this GitHub repo.

Set up your Machine

What do I need to start a Django project?

It is recommended that you create a new environment
And that you have Anaconda installed. If not, click here to download & install
You need to install the library in your terminal (use Anaconda Prompt for Windows Users):
```
  conda create -n django_env django  conda activate django_env
```

Start the Django Project

Ok, you got it. What's next?

Open a Code Editor application to start working more comfortable with the project
I use Visual Studio Code (aka VSCode), you may download & install it here

What should I do within VSCode?

You will use the Django CLI installed with the Django package already
To create the standard folders and files you need for the application
Type the following line within the terminal:
```
  django-admin startproject shop
```

What should I see on my computer after this?

If you open your user folder, you will see that
A folder shop has been created
drag & drop it to VSCode
Now check the folder structure and familiarize yourself with the files & folders
- The folder structure should look like this

- shop/    - manage.py    - shop/        - __init__.py        - settings.py        - urls.py        - asgi.py        - wsgi.py

Do I need to study all of them?

No, just go with the flow, and you'll get to understand everything at the end

See the Default Django Website

Ok, what's the next step?

You'll probably want to see your Django App up and running, right?
Then, go over the terminal and write the following

cd shoppython manage.py runserver

A local server has opened in http://127.0.0.1:8000/, open it in a web browser
Which references the localhost and you should see something like this

What if I try another URL like http://127.0.0.1:8000/products?

You will receive an error because
You didn't tell Django what to do when you go to http://127.0.0.1:8000/products

How can I tell that to Django?

Create an App within the `shop` Django Project

With the following line of code
```
  python manage.py startup products
```

The URL

Create an URL within the file shop > urls.py

from django.contrib import adminfrom django.urls import path, include # modifiedurlpatterns = [    path('products/', include('products.urls')), # added    path('admin/', admin.site.urls),]

The View

Create a View (HTML Code) to be recognised when you go to the URL http://127.0.0.1:8000/products
Within the file shop > products > views.py

from django.http import HttpResponsedef view_for_products(request):    return HttpResponse("This function will render `HTML` code that makes you see this text in red
.")

See this tutorial if you want to know a bit more about HTML
Call the function view_for_products when you click on http://127.0.0.1:8000/products
You need to create the file urls.py within products shop > products > urls.py

from django.urls import pathfrom . import viewsurlpatterns = [    path('', views.view_for_products, name='index'),]

Connecting Project URLs with App URLs

Why do we reference the URLs in two files? One in shop/urls.py folder and the other in products/urls.py?

It is a best practice to have a Django project separated by different Apps
In this case, we created the products App
In our the file shop/urls.py, you reference the products.py URLs here

urlpatterns = [    path('products/', include('products.urls')), #here    path('admin/', admin.site.urls),]

So that at the time you navigate to https://127.0.0.1:8000/products
You will have access to the URLs defined in shop/products/urls.py
For example, let's create another View in shop/products/views.py

def new_view(request):    return HttpResponse('This is the new view')

And reference it in the file shop/products/urls.py

from django.urls import pathfrom . import viewsurlpatterns = [    path('', views.view_for_products, name='index'),    path('pepa', views.new_view, name='pepa'), # new url]

We don't need to reference the View in shop/urls.py since
we can access all URLs in shop/products/urls.py at the time we wrote
include('products.urls') in the file shop/urls.py
Try to go to https://127.0.0.1:8000/products/pepa

Summary

So, each time I want to create a different HTML, do I need to create a View?

Yes, it's how the Model View Template (MVT) works
You introduce an URL
The URL activates a View
And HTML code gets rendered in the website

Why don't you mention anything about the model?

Well, that's something to cover in the following article 🔥 COMING SOON!

Any doubts?

Let me know in the comments; I'd be happy to help!

#06 | The Principal Component Analysis (PCA) & Dimensionality Reduction Techniques

Jesús López — Mon, 08 Nov 2021 22:13:21 GMT

Jess Lpez 2022

Ask him any doubt on Twitter or LinkedIn

Chapter Importance

We used just two variables out of the seven we had in the whole DataFrame.

We could have computed better cluster models by giving more information to the Machine Learning model. Nevertheless, it would have been harder to plot seven variables with seven axes in a graph.

Is there anything we can do to compute a clustering model with more than two variables and later represent all the points along with their variables?

Yes, everything is possible with data. As one of my teachers told me: "you can torture the data until it gives you what you want" (sometimes it's unethical, so behave).

We'll develop the code to show you the need for dimensionality reduction techniques. Specifically, the Principal Component Analysis (PCA).

Load the Data

Imagine for a second you are the president of the United States of America, and you are considering creating campaigns to reduce car accidents.

You won't create 51 TV campaigns, one for each of the States of the USA (rows). Instead, you will see which States behave similarly to cluster them into 3 groups based on the variation across their features (columns).

import seaborn as sns #!df_crashes = sns.load_dataset(name='car_crashes', index_col='abbrev')df_crashes

Check this website to understand the measures of the following data.

Data Preprocessing

From the previous chapter, we should know that we need to preprocess the Data so that variables with different scales can be compared.

For example, it is not the same to increase 1kg of weight than 1m of height.

We will use StandardScaler() algorithm:

from sklearn.preprocessing import StandardScalerscaler = StandardScaler()data_scaled = scaler.fit_transform(df_crashes)data_scaled[:5]

array([[ 0.73744574,  1.1681476 ,  0.43993758,  1.00230055,  0.27769155,        -0.58008306,  0.4305138 ],       [ 0.56593556,  1.2126951 , -0.21131068,  0.60853209,  0.80725756,         0.94325764, -0.02289992],       [ 0.68844283,  0.75670887,  0.18761539,  0.45935701,  1.03314134,         0.0708756 , -0.98177845],       [ 1.61949811, -0.48361373,  0.54740815,  1.67605228,  1.95169961,        -0.33770122,  0.32112519],       [-0.92865317, -0.39952407, -0.8917629 , -0.594276  , -0.89196792,        -0.04841772,  1.26617765]])

Let's turn the array into a DataFrame for better understanding:

import pandas as pddf_scaled = pd.DataFrame(data_scaled, index=df_crashes.index, columns=df_crashes.columns)df_scaled

Now we see all the variables having the same scale (i.e., around the same limits):

df_scaled.agg(['min', 'max'])

k-Means Model in Python

We follow the usual Scikit-Learn procedure to develop Machine Learning models.

Import the Class

from sklearn.cluster import KMeans

Instantiate the Class

model_km = KMeans(n_clusters=3)

Fit the Model

model_km.fit(X=df_scaled)

KMeans(n_clusters=3)

Calculate Predictions

model_km.predict(X=df_scaled)

array([1, 1, 1, 1, 2, 0, 2, 1, 2, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 2, 2,       2, 2, 0, 1, 1, 0, 0, 0, 2, 0, 2, 1, 1, 0, 1, 0, 1, 2, 1, 1, 1, 1,       0, 0, 0, 0, 1, 0, 1], dtype=int32)

Create a New DataFrame for the Predictions

df_pred = df_scaled.copy()

Create a New Column for the Predictions

df_pred.insert(0, 'pred', model_km.predict(X=df_scaled))df_pred

Visualize the Model

Now let's visualize the clusters with a 2-axis plot:

sns.scatterplot(x='total', y='speeding', hue='pred',                data=df_pred, palette='Set1');

Model Interpretation

Does the visualization make sense?

No, because the clusters should separate their points from others. Nevertheless, we see some green points in the middle of the blue cluster.

Why is this happening?

We are just representing 2 variables where the model was fitted with 7 variables. We can't see the points separated as we miss 5 variables in the plot.

Why don't we add 5 variables to the plot then?

We could, but it'd be a way too hard to interpret.

Then, what could we do?

We can apply PCA, a dimensionality reduction technique. Take a look at the following video to understand this concept:

https://www.youtube.com/watch?v=HMOI_lkzW08

Grouping Variables with `PCA()`

Transform Data to Components

PCA() is another technique used to transform data.

How has the data been manipulated so far?

Original Data df_crashes

df_crashes

Normalized Data df_scaled

df_scaled

Principal Components Data dfpca (now)

from sklearn.decomposition import PCApca = PCA()data_pca = pca.fit_transform(df_scaled)data_pca[:5]

array([[ 1.60367129,  0.13344927,  0.31788093, -0.79529296, -0.57971878,         0.04622256,  0.21018495],       [ 1.14421188,  0.85823399,  0.73662642,  0.31898763, -0.22870123,        -1.00262531,  0.00896585],       [ 1.43217197, -0.42050562,  0.3381364 ,  0.55251314,  0.16871805,        -0.80452278, -0.07610742],       [ 2.49158352,  0.34896812, -1.78874742,  0.26406388, -0.37238226,        -0.48184939, -0.14763646],       [-1.75063825,  0.63362517, -0.1361758 , -0.97491605, -0.31581147,         0.17850962, -0.06895829]])

df_pca = pd.DataFrame(data_pca)df_pca

cols_pca = [f'PC{i}' for i in range(1, pca.n_components_+1)]cols_pca

['PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6', 'PC7']

df_pca = pd.DataFrame(data_pca, columns=cols_pca, index=df_crashes.index)df_pca

Visualize Components & Clusters

Let's visualize a scatterplot with PC1 & PC2 and colour points by cluster:

import plotly.express as pxpx.scatter(data_frame=df_pca, x='PC1', y='PC2', color=df_pred.pred)

Are they mixed now?

No, they aren't.

That's because both PC1 and PC2 represent almost 80% of the variability of the original seven variables.

You can see the following array, where every element represents the amount of variability explained by every component:

pca.explained_variance_ratio_

array([0.57342168, 0.22543042, 0.07865743, 0.05007557, 0.04011   ,       0.02837999, 0.00392491])

And the accumulated variability (79.88% until PC2):

pca.explained_variance_ratio_.cumsum()

array([0.57342168, 0.7988521 , 0.87750953, 0.9275851 , 0.9676951 ,       0.99607509, 1.        ])

Which variables represent these two components?

Relationship between Original Variables & Components

Loading Vectors

The Principal Components are produced by a mathematical equation (once again), which is composed of the following weights:

df_weights = pd.DataFrame(pca.components_.T, columns=df_pca.columns, index=df_scaled.columns)df_weights

We can observe that:

Socio-demographical features (total, speeding, alcohol, not_distracted & no_previous) have higher coefficients (higher influence) in PC1.
Whereas insurance features (ins_premium & ins_losses) have higher coefficients in PC2.

Principal Components is a technique that gathers the maximum variability of a set of features (variables) into Components.

Therefore, the two first Principal Components accurate a good amount of common data because we see two sets of variables that are correlated with each other:

Correlation Matrix

df_corr = df_scaled.corr()sns.heatmap(df_corr, annot=True, vmin=0, vmax=1);

I hope that everything is making sense so far.

To ultimate the explanation, you can see below how df_pca values are computed:

Calculating One PCA Value

For example, we can multiply the weights of PC1 with the original variables for ALabama:

(df_weights['PC1']*df_scaled.loc['AL']).sum()

1.6036712920638672

To get the transformed value of the Principal Component 1 for ALabama State:

df_pca.head()

The same operation applies to any value of df_pca.

PCA & Cluster Interpretation

Now, let's go back to the PCA plot:

px.scatter(data_frame=df_pca, x='PC1', y='PC2', color=df_pred.pred.astype(str))

How can we interpret the clusters with the components?

Let's add information to the points thanks to animated plots from plotly library:

hover = '''%{customdata[0]}

PC1: %{x}
Total: %{customdata[1]}
Alcohol: %{customdata[2]}

PC2: %{y}
Ins Losses: %{customdata[3]}
Ins Premium: %{customdata[4]}'''fig = px.scatter(data_frame=df_pca, x='PC1', y='PC2',                 color=df_pred.pred.astype(str),                 hover_data=[df_pca.index, df_crashes.total, df_crashes.alcohol,                             df_crashes.ins_losses, df_crashes.ins_premium])fig.update_traces(hovertemplate = hover)

If you hover the mouse over the two most extreme points along the x-axis, you can see that their values coincide with the min and max values across socio-demographical features:

df_crashes.agg(['min', 'max'])

df_crashes.loc[['DC', 'SC'],:]

Apply the same reasoning over the two most extreme points along the y-axis. You will see the same for the insurance variables because they determine the positioning of the PC2 (y-axis).

df_crashes.agg(['min', 'max'])

df_crashes.loc[['ID', 'LA'],:]

Is there a way to represent the weights of the original data for the Principal Components and the points?

That's called a Biplot, which we will see later.

Biplot

We can observe how we position the points along the loadings vectors. Friendly reminder: the loading vectors are the weights of the original variables in each Principal Component.

import numpy as nploadings = pca.components_.T * np.sqrt(pca.explained_variance_)evr = pca.explained_variance_ratio_.round(2)fig = px.scatter(df_pca, x='PC1', y='PC2',                 color=model_km.labels_.astype(str),                 hover_name=df_pca.index,                labels={                    'PC1': f'PC1 ~ {evr[0]}%',                    'PC2': f'PC2 ~ {evr[1]}%'                })for i, feature in enumerate(df_scaled.columns):    fig.add_shape(        type='line',        x0=0, y0=0,        x1=loadings[i, 0],        y1=loadings[i, 1],        line=dict(color="red",width=3)    )    fig.add_annotation(        x=loadings[i, 0],        y=loadings[i, 1],        ax=0, ay=0,        xanchor="center",        yanchor="bottom",        text=feature,    )fig.show()

Conclusion

Dimensionality Reduction techniques have many more applications, but I hope you got the essence: they are great for grouping variables that behave similarly and later visualising many variables in just one component.

In short, you are simplifying the information of the data. In this example, we simplify the data from plotting seven to only two dimensions. Although we don't get this for free because we explain around 80% of the data's original variability.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Tutorial | Machine Learning Model Deployment

Jesús López — Wed, 03 Nov 2021 15:28:22 GMT

We already know that a Machine Learning Model is a mathematical formula to calculate something

https://twitter.com/sotastica/status/1449735653328031745

Machine Learning Models are deployed to, for example:

Predict objects within an image (Tesla) so that the car can take actions
Spotify recommends songs to a user so that you'd fall in love with the service
Most likely to interact posts in Facebook or Twitter so that you will spend more time on the app

If you just care about getting the code to make this happen, you can forget the storytelling and get right into those lines in GitHub

If you want to follow the tutorial and understand the topic in depth, let's get started

Let's say that we are a car sales company and we want to make things easier for clients when they decide which car to buy.

They usually don't want to have a car that consumes lots of fuel mpg.

Nevertheless, they won't know this until they use the car.

Is there a way to predict the consumption based on other characteristics of the car?

Yes, with a mathematical formula, for example:

consumption = 2 + 3 * acceleration * 2.1  horsepower

We have historical data from all cars models we have sold over the past few years.

We could use this data to calculate the BEST mathematical formula.

And deploy it to a website with a form to solve the consumption question by themselves.

To make this happen, we will follow the structure:

Create ML Model Object in Python
Create an HTML Form
Create Flask App
Deploy to Heroku
Visit Website and Make a Prediction

Create ML Model Object in Python

Import Data to Python

This dataset contains information about car models (rows)
For which we have some characteristics (columns)

import seaborn as snsdf = sns.load_dataset(name='mpg', index_col='name')[['acceleration', 'weight', 'mpg']]df.sample(5)

	acceleration	weight	mpg
name
subaru	17.8	2065	32.3
bmw 2002	12.5	2234	26.0
audi 5000	15.9	2830	20.3
toyota corolla 1200	21.0	1836	32.0
ford gran torino (sw)	16.0	4638	14.0

Linear Regression Model from Historical Data

from sklearn.linear_model import LinearRegressionmodel = LinearRegression()model.fit(X=df[['acceleration', 'weight']], y=df['mpg'])model.__dict__

{'fit_intercept': True, 'normalize': False, 'copy_X': True, 'n_jobs': None, 'positive': False, 'n_features_in_': 2, 'coef_': array([ 0.25081589, -0.00733564]), '_residues': 7317.984100916719, 'rank_': 2, 'singular_': array([16873.21840634,    49.92970477]), 'intercept_': 41.39982830200016}

And the BEST mathematical formula is:

consumption = 41.39 + 0.25 * acceleration - 0.0073 * weight

Save `LinearRegression()` into a File

The object LinearRegression() contains the Mathematical Formula
That we will use in the website to make the prediction

import picklewith open('linear_regression_model.pkl', 'wb') as f:    pickle.dump(model, f)

Now a file called linear_regression_model.pkl should appear in the same folder that your script.

Create HTML form

All websites that you see online are displayed through an HTML file.

Therefore, we need to create an HTML file that contains a form for the user to input the data.

And calculate the prediction for the fuel consumption.

Website example here

Let's head over a Code Editor (VSCode in my case) and create a new file called index.html

You may download Visual Studio Code (VSCode) here

That should contain the following lines:

html><html lang="en">  <head>    <meta charset="UTF-8" />    <meta http-equiv="X-UA-Compatible" content="IE=edge" />    <meta name="viewport" content="width=device-width, initial-scale=1.0" />    <title>Documenttitle>  head>  <body>    <form>      <label for="acceleration">Acceleration (m/s^2):label><br />      <input        type="number"        id="acceleration"        name="acceleration"        value="34"      /><br />      <label for="weight">Weight (kg):label><br />      <input type="number" id="weight" name="weight" value="12" /><br /><br />      <input type="submit" value="Submit" />    form>  body>html>

If you open the file index.html in a browser, you will see the form.
And the submit button that is supposed to calculate the prediction.
Nevertheless, if you click, nothing will happen.
Because we need to develop the Flask application to send the user input to a mathematical formula to calculate the prediction and return that into the website.

Create Flask App

As we are going to develop a whole application to a web server (Heroku), we need to create a dedicated environment with just the necessary packages.

Let's head over the terminal and type the following commands:

python -m venv car_consumption_predictionsource car_consumption_prediction/bin/activate

Now let's install the required packages:

pip install flaskpip install scikit-learn

Now you should open the folder car_consumption_prediction in a Code Editor
And create a new folder app with two other folders inside:

- app    - model    - templates

Then move the files we created before to its corresponding folders:

- app    - model        - linear_regression_model.pkl    - templates        - index.html

Now that we have the project structure, let's continue with the core functionality

We will build a Python script that handles the user input and make the prediction for fuel consumption

So, create a new file within app folder called app.py

PS: This is the most important file in a Flask app because it manages everything.

- app    - model        - linear_regression_model.pkl    - templates        - index.html    - app.py

And add the following lines of code:

import flaskimport picklewith open(f'model/linear_regression_model.pkl', 'rb') as f:    model = pickle.load(f)app = flask.Flask(__name__, template_folder='templates')@app.route('/', methods=['GET', 'POST'])def main():    if flask.request.method == 'GET':        return(flask.render_template('index.html'))    elif flask.request.method == 'POST':        acceleration = flask.request.form['acceleration']        weight = flask.request.form['weight']        input_variables = [[acceleration, weight]]        prediction = model.predict(input_variables)[0]        return flask.render_template('index.html',                                     original_input={'Acceleration': acceleration,                                                     'Weight': weight},                                     result=prediction,                                     )if __name__ == '__main__':    app.run()

We need to pay attention to what's going on in the last return ...:
The function render_template() is passing the objects from parameters original_input and result to index.html
Then, how can we use this variables in the file index.html?
Copy-paste the following lines of code into index.html:

html><html lang="en">  <head>    <meta charset="UTF-8" />    <meta http-equiv="X-UA-Compatible" content="IE=edge" />    <meta name="viewport" content="width=device-width, initial-scale=1.0" />    <title>Documenttitle>  head>  <body>    <form action="{{ url_for('main') }}" method="POST">      <label for="acceleration">Acceleration (m/s^2):label><br />      <input type="number" id="acceleration" name="acceleration" required /><br />      <label for="weight">Weight (kg):label><br />      <input type="number" id="weight" name="weight" required /><br /><br />      <input type="submit" value="Submit" />    form>    <br />    {% if result %}    <p>      The calculated fuel consumption is      <span style="color: orange">{{result}}span>    p>    {% endif %}  body>html>

We made two changes to the file:

Specify the action to take when form is submitted:

 <form action="{{ url_for('main') }}" method="POST">

Show the prediction below the form

 {% if result %} <p>   The calculated fuel consumption is   <span style="color: orange">{{result}}span> p> {% endif %}

In this case, we had to use the conditional if to display result if existed, as result won't exist until the form is submitted and the server computes the prediction in app.py.

Add the Procfile

I did some research about an error in which Heroku wasn't working the way I expected

And found that I needed to add a Procfile

Create a file in the folder app called procfile

Write the following line and save the file:

 web: gunicorn app:app

The folder structure will now be:

 - app     - model         - linear_regression_model.pkl     - templates         - index.html     - app.py     - procfile

Install the gunicorn package in the virtual environment. In terminal:
```
 pip install gunicorn
```

Deploy to Heroku

Now it's the time to upload the application to Heroku so that anyone can get its prediction on fuel comsumption given a car's acceleration and weight.

Create an Account in Heroku.
Download Heroku CLI
Create the Heroku App within the Terminal:

heroku create ml-model-deployment-car-mpg

This will be traduced into a website called https://ml-model-deployment-car-mpg.herokuapp.com/

PS: You should use a different name instead of ml-model-deployment-car-mpg heroku will turn your repository into an url.

Commit the app files to your heroku hosting.

git init within car_consumption_prediction folder

Create a requirements.txt file with the instruction for required packages. You could automatically create this by:

 pip freeze > requirements.txt

The folder structure will now be:

 - app     - model         - linear_regression_model.pkl     - templates         - index.html     - app.py     - procfile     - requirements.txt

Add the files for commit.
```
 git add .
```

Commit the files to the remote

 git commit -m 'some random message' git push heroku master

That's all the technical aspect.

Now if some user would like to use the app...

Visit Website and Make a Prediction

Visit https://ml-model-deployment-car-mpg.herokuapp.com/
Introduce some numbers in the form
Submit and watch the prediction

References

https://blog.cambridgespark.com/deploying-a-machine-learning-model-to-the-web-725688b851c7

Pepa, una historia muy común

Jesús López — Sun, 03 Oct 2021 21:32:36 GMT

Te llamas Pepa y te abruman las expectativas al haber credo siempre que una carrera te conducira a un puesto de trabajo en el que realizarte. Sin embargo, cuando ests a punto de conseguirlo, vuelves a la formacin: una formacin con salidas laborales que desafa tus conocimientos y te hace crecer como profesional.

Es cuando empiezas a buscar los msteres ms prestigiosos y relacionados con los puestos de trabajo ms demandados de hoy en da y encuentras algunos relacionados con la Inteligencia artificial y con Python; tambin otros sobre Big data o Machine Learning. Con tanta oferta, te pica la curiosidad y sopesas las distintas oportunidades. Descubres un inters por la programacin que antes estaba escondido, ya que la informtica nunca ha sido tu fuerte y eliges un mster adecuado a tus nuevas inquietudes. Por supuesto este mster te har destacar en el mercado laboral, facilitndote el camino hacia tu puesto deseado.

Una vez que has elegido el curso, solicitas la matrcula y te aceptan, lo que te genera una gran alegra y te da motivacin extra para empezar tus estudios. Al principio llevas todo al da, pero en un momento determinado las cosas se ponen ms difciles de lo que esperabas y te ves realmente perdida. En este momento, buscas ayuda en google y en la bibliografa que los profesores te aconsejan continuamente, y aun as no superas los problemas.

El mster merece la pena y tienes que terminarlo, por lo que decides buscar un profesor particular que te ayude con las prcticas y as salir adelante. Al principio te ves capaz de superar las prcticas a la vez que aprendes las soluciones; sin embargo, con el paso de los das, te das cuenta de que no dispones del tiempo suficiente para hacer las dos cosas: el trabajo se acumula. Aqu te pones nerviosa y empiezas a perder el sentido del aprendizaje, razn por la que empezaste este curso. El ttulo (la titulitis) ocupa un lugar primordial en tu cabeza y lo nico que quieres a estas alturas es poder acceder a las expectativas laborales que este te ofrece, por lo que, adems de los costes de la matrcula, te gastas una gran cantidad de dinero en el profesor que te ayuda a hacer las prcticas. Entretanto, te convences a ti misma de que en el futuro estudiars el contenido que ahora ests dejando en otras manos por la falta de tiempo, pero tu mantra se convierte en un crculo vicioso del que no puedes salir, ya que en cada entrevista de trabajo te exigen ejercicios previos a la obtencin del puesto que te resultan imposibles de realizar.

Tus ilusiones y sueos se desvanecen poco a poco.

Tras reflexionar seriamente, Pepa no entiende cmo ha podido llegar hasta esta situacin; cmo ha podido perder la confianza en tan poco tiempo. Lo que ella no sabe es que su caso es muy comn, que no es la nica persona que ha acabado as. En cambio, nosotros s lo sabemos, porque es la historia de muchos de nuestros clientes; es el problema de varias personas que han contactado con Sotstica. Para ello, contamos con el profesor Jess Lpez, uno de los profesores particulares mejor valorados en Espaa. Jess puede ayudaros a entender mejor el lenguaje de programacin Python con una aplicacin directa y dinmica a la Ciencia de Datos. Tras dar clase a ms de 300 personas y con ms de 3000 horas de docencia en estos dos ltimos aos, ha desarrollado una metodologa donde conecta todos los tpicos de la Ciencia de Datos y se asegura de que los comprendas para hacer cdigo por ti mismo. La duracin del programa gira en torno a las 25 horas, que se compaginan con sesiones explicativas, tareas a realizar por el alumno y sesiones de correccin.

Haz click aqu para ver las valoraciones de sus alumnos.

En nuestra pgina web podis ver el contenido del programa para convertirte en un Data Scientist creativo.

How to Analyze Data through Visualization

Jesús López — Sun, 03 Oct 2021 12:43:23 GMT

What is a plot?

A visual representation of the data

Which data? How is it usually structured?

In a table. For example:

import seaborn as snsdf = sns.load_dataset('mpg', index_col='name')df.head()

How can you Visualice this DataFrame?

We could make a point for every car based on
1. weight
2. mpg

sns.scatterplot(x='weight', y='mpg', data=df);

Which conclusions can you make out of this plot?

Well, you may observe that the location of the points are descending as we move to the right
This means that the weight of the car may produce a lower capacity to make kilometres mpg

How can you measure this relationship?

Linear Regression

from sklearn.linear_model import LinearRegressionmodel = LinearRegression()model.fit(X=df[['weight']], y=df.mpg)model.__dict__

Resulting in

{'fit_intercept': True, 'normalize': False, 'copy_X': True, 'n_jobs': None, 'n_features_in_': 1, 'coef_': array([-0.00767661]), '_residues': 7474.8140143821, 'rank_': 1, 'singular_': array([16873.20281508]), 'intercept_': 46.31736442026565}

Which is the mathematical formula for this relationship?

$$mpg = 46.31 - 0.00767 \cdot weight$$

This equation means that the mpg gets 0.00767 units lower for every unit that weight increases.

Could you visualise this equation in a plot?

Absolutely, we could make the predictions from the original data and plot them.

Predictions

y_pred = model.predict(X=df[['weight']])dfsel = df[['weight', 'mpg']].copy()dfsel['prediction'] = y_preddfsel.head()

	weight	mpg	prediction
name
chevrolet chevelle malibu	3504	18.0	19.418523
buick skylark 320	3693	15.0	17.967643
plymouth satellite	3436	18.0	19.940532
amc rebel sst	3433	16.0	19.963562
ford torino	3449	17.0	19.840736

Out of this table, you could observe that predictions don't exactly match the reality, but it approximates.
For example, Ford Torino's mpg is 17.0, but our model predicts 19.84.

Model Visualization

sns.scatterplot(x='weight', y='mpg', data=dfsel)sns.scatterplot(x='weight', y='prediction', data=dfsel);

The blue points represent the actual data.
The orange points represent the predictions of the model.

I teach Python, R, Statistics & Data Science. I like to produce content that helps people to understand these topics better.
Feel free and welcomed to give me feedback as I would like to make my tutorials clearer and generate content that interests you 🤗
You can see my Tutor Profile here if you need Private Tutoring lessons.

Why do all Machine Learning models follow the same steps?

Jesús López — Fri, 01 Oct 2021 12:47:17 GMT

Introduction

It's tough to find things that always work the same way in programming.

The steps of a Machine Learning (ML) model can be an exception.

Each time we want to compute a model (mathematical equation) and make predictions with it, we would always make the following steps:

model.fit() to compute the numbers of the mathematical equation..
model.predict() to calculate predictions through the mathematical equation.
model.score() to measure how good the model's predictions are.

And I am going to show you this with 3 different ML models.

DecisionTreeClassifier()
RandomForestClassifier()
LogisticRegression()

Load the Data

But first, let's load a dataset from CIS executing the lines of code below:

The goal of this dataset is
To predict internet_usage of people (rows)
Based on their socio-demographical characteristics (columns)

import pandas as pddf = pd.read_csv('https://raw.githubusercontent.com/jsulopz/data/main/internet_usage_spain.csv')df.head()

	internet_usage	sex	age	education
0	0	Female	66	Elementary
1	1	Male	72	Elementary
2	1	Male	48	University
3	0	Male	59	PhD
4	1	Female	44	PhD

Data Preprocessing

We need to transform the categorical variables to dummy variables before computing the models:

df = pd.get_dummies(df, drop_first=True)df.head()

Feature Selection

Now we separate the variables on their respective role within the model:

target = df.internet_usageexplanatory = df.drop(columns='internet_usage')

ML Models

Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifiermodel = DecisionTreeClassifier()model.fit(X=explanatory, y=target)pred_dt = model.predict(X=explanatory)accuracy_dt = model.score(X=explanatory, y=target)

Support Vector Machines

from sklearn.svm import SVCmodel = SVC()model.fit(X=explanatory, y=target)pred_sv = model.predict(X=explanatory)accuracy_sv = model.score(X=explanatory, y=target)

K Nearest Neighbour

from sklearn.neighbors import KNeighborsClassifiermodel = KNeighborsClassifier()model.fit(X=explanatory, y=target)pred_kn = model.predict(X=explanatory)accuracy_kn = model.score(X=explanatory, y=target)

The only thing that changes are the results of the prediction. The models are different. But they all follow the same steps that we described at the beginning:

model.fit() to compute the mathematical formula of the model
model.predict() to calculate predictions through the mathematical formula
model.score() to get the success ratio of the model

Comparing Predictions

You may observe in the following table how the different models make different predictions, which often doesn't coincide with reality (misclassification).

For example, model_svm doesn't correctly predict the row 214; as if this person used internet pred_svm=1, but they didn't: internet_usage for 214 in reality is 0.

df_pred = pd.DataFrame({'internet_usage': df.internet_usage,                        'pred_dt': pred_dt,                        'pred_svm': pred_sv,                        'pred_lr': pred_kn})df_pred.sample(10, random_state=7)

	internet_usage	pred_dt	pred_svm	pred_lr
214	0	0	1	0
2142	1	1	1	1
1680	1	0	0	0
1522	1	1	1	1
325	1	1	1	1
2283	1	1	1	1
1263	0	0	0	0
993	0	0	0	0
26	1	1	1	1
2190	0	0	0	0

Choose Best Model

Then, we could choose the model with a higher number of successes on predicting the reality.

df_accuracy = pd.DataFrame({'accuracy': [accuracy_dt, accuracy_sv, accuracy_kn]},                           index = ['DecisionTreeClassifier()', 'SVC()', 'KNeighborsClassifier()'])df_accuracy

	accuracy
DecisionTreeClassifier()	0.859878
SVC()	0.783707
KNeighborsClassifier()	0.827291

Which is the best model here?

Let me know in the comments below

You Need to Have Flexible Thinking when Programming

Jesús López — Wed, 29 Sep 2021 09:50:02 GMT

If you are trying to solve a problem with programming, you may have several solutions to get the same result.

A basic idea that we don't get at the beginning because we look for that perfect solution.

It doesnt exist.

It would be best to start thinking about choosing the "one" option, not "the" option.

Lets say that we are facing the following problem: visualise two variables with a scatterplot.

import pandas as pddf = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv')df.head()

In Python, youve got 3 libraries that can make a scatterplot:

matplotlib
seaborn
plotly

Lets observe the differences:

Matplotlib

import matplotlib.pyplot as pltplt.scatter(x='total_bill', y='tip', data=df)

Seaborn

import seaborn as snssns.scatterplot(x='total_bill', y='tip', data=df)

Plotly

import plotly.express as pxpx.scatter(data_frame=df, x='total_bill', y='tip')

Takeaways

matplotlib allows you to create custom plots, but you need to write more code.
seaborn automates the plot so that you dont need to write more lines. For example, seaborn added the x & y axis labels by default. matplotlib didnt.
plotly allows you to interact with the plot. Give it a try and hover the mouse over the points.

If you are to make a plot for an online post, you may like to use plotly due to its interactivity. Nevertheless, you wouldnt use it if you were writing a paper article.

I teach Python, R, Statistics & Data Science. I like to produce content that helps people to understand these topics better.
Feel free and welcomed to give me feedback as I would like to make my tutorials clearer and generate content that interests you 🤗
You can see my Tutor Profile here if you need Private Tutoring lessons.

Your First Lines of Code in Python

Jesús López — Tue, 28 Sep 2021 21:16:58 GMT

Programming is hard, especially at the beginning.

Don't make it yourself any harder!

Start with Data Visualization.

It's easier to understand programming with visual changes than abstract coding ("make a program that prints even numbers").

Get on Jupyter, a code editor. Here is the link to [download the program](https://www.anaconda.com/products/individual.

Your first lines of code should be as follows:

import seaborn as snsdf = sns.load_dataset('tips')sns.scatterplot(x='total_bill', y='tip', data=df)

You would get a plot that should look like this one

To configure the behaviour of the function, you should configure the code as follows:

sns.scatterplot(x='total_bill', y='tip', data=df, color='red')

This simple change helps you to understand a couple of core concepts in programming:

Functions sns.scatteplot() are used to make things in programming (a plot in this case).
You use parameters color='red' to configure the function's behaviour.

Feel free and welcome to ask me anything in the comments below, it will be my pleasure to help you out 🤗

Cómo ser más Productivo en Python

Jesús López — Mon, 27 Sep 2021 10:58:44 GMT

Te pasas mucho tiempo cambiando entre el teclado y el ratn?
Quieres desarrollar habilidades para trabajar ms eficiente con Python?
Quieres parecerte a un hacker que hace todo con shortcuts/atajos?

Este es tu tutorial!

1 Instala Jupyter Lab

Aqu te dejo un par de tutoriales para que instales la herramienta que uso para trabajar con Python y la ms recomendada: Jupyter Lab.

Si clicas en el vdeo, te llevar a una playlist donde vers dos vdeos: uno para instalar en macOS (Apple) y otro para Windows.

https://www.youtube.com/watch?v=gUE2vcA_qfw&list=PL8HtbO24Pl3jOg_sW_OyC09rwsYyasgfk

2 Sugerencias `tab`

Muchas veces no sabemos exactamente las letras de las funciones.

Pueden variar las letras maysculas y minsculas, o nos comemos algunan s porque son palabras inglesas.

Estas indecisiones se acabaron con el siguiente truco:

3 Manual Instrucciones de las Funciones `shift + tab`

Muchas veces recurrimos a Google a buscar ayuda porque no sabemos lo que poner dentro de la funcin.

Pues bien, si usas las teclas shift + tab con el cursor en alguna letra de la funcin, vers un panel de ayuda.

Este panel es un manual de instrucciones sobre cmo usar las funciones.

4 Adopta esta Disciplina pa' Siempre

Curiosamente, ejecutamos acciones rutinarias de la misma forma que aprendimos la primera vez.

Nos va a costar ms desacer las malas costumbres que adquirir las buenas.

Los dos consejos que te expongo harn que la mquina trabaje para ti porque sabrs lo que s se puede y lo que no se puede hacer.

Evitars perderte cuando uses funciones, o quieras importar un objeto o una librera.

As que te merece la pena usarlo cada vez que tengas la oportunidad para adoptarlo como hbito.

Recuerda:

tab para las sugerencias.
shift + tab para la ayuda.

salos la prxima vez que tengas la oportunidad.

Para una explicacin ms detallada y dinmica, te recomiendo que veas este vdeo

https://www.youtube.com/watch?v=QmkwHzGztZ8

Resolving Python

#01 Challenge | Delhi's Air Quality Data

Why this Data topic?

Getting the Data

Learning Materials

Start the challenge

d | Diff to get data stationarity

q | Autocorrelation Plot

p | Partial Autocorrelation Plot

Summarise Time Series data with the DataFrame.resample function

#06 | Locating & Filtering the pandas.DataFrame

Possibilities

Any Object

Series

iloc (integer-location) property

loc (location) property

Masking with boolean objects

Just the brackets []

DataFrame

iloc (integer-location) property

loc (location) property

Masking with boolean objects

Single Condition

Multiple Conditions

Both Conditions &

Any Condition |

Just the brackets []

DataFrame MultiIndex

First Index

Second Index

DataFrame MultiIndex & MultiColumns

loc (location) property

First Index

Second Index

Second Index & Second Column

Second Index & First Column

Using the Slice

iloc (integer-location) property

DataFrame with DateTimeIndex

loc (location) property

iloc (integer-location) property

How Edo Guida got a Python job in three months with no programming background

Skilled Python Professionals' Demand

Get a Job to Learn Python

Where to start

Roadmap

#05 | The k-Means & Unsupervised Clustering Models

Challenge Importance

Load the Data

Data Preprocessing

Missing Data

Dummy Variables

How do we compute a k-Means Model in Python?

Import the Class

Instantiante the Class

Access the Function

Separate the Variables

Fit the Model

Predictions

Calculate Predictions

Add a New Column with the Predictions

Model Visualization & Interpretation

KMeans Algorithm

MinMaxScaler() the data

k-Means Model with Scaled Data

Fit the Model

Predictions

Calculate Predictions

Add a New Column with the Predictions

Model Visualization & Interpretation

Takeaway

Other Clustering Models in Python

Agglomerative Clustering

Fit the Model

Calculate Predictions

Create a New Column for the Predictions

Visualize the Model

Spectral Clustering

Fit the Model

Calculate Predictions

`d` | Diff to get data stationarity

`q` | Autocorrelation Plot

`p` | Partial Autocorrelation Plot

`iloc` (integer-location) property

`loc` (location) property

Just the brackets `[]`

`iloc` (integer-location) property

`loc` (location) property

Both Conditions `&`

Any Condition `|`

Just the brackets `[]`

`loc` (location) property

`iloc` (integer-location) property

`loc` (location) property

`iloc` (integer-location) property

`MinMaxScaler()` the data

Other `Clustering` Models in Python

`DecisionTreeClassifier()` with Default Hyperparameters

In `train` data

In `test` data

`DecisionTreeClassifier()` with Custom Hyperparameters

In `train` data

In `test` data

In `train` data

In `test` data

In `train` data

In `test` data

In `train` data

In `test` data

In `train` data

In `test` data

`GridSearchCV()` to find Best Hyperparameters

Support Vector Machines `SVC()`

`KNeighborsClassifier()`