#03 | Grouping & Pivot Tables

Learn how to summarise the information using the pandas DataFrame in order to become an efficient Data Analyst

Jesús López's photo
Jesús López
·Jul 22, 2022·

5 min read

#03 | Grouping & Pivot Tables

Subscribe to my newsletter and never miss my upcoming articles

Table of contents

© Jesús López 2022

Ask him any doubt on Twitter or LinkedIn

Possibilities

Look at the following example as an aspiration you can achieve if you fully understand and replicate this whole tutorial with your data.

Let's load a dataset that contains information from transactions in tables (rows) at a restaurant considering socio-demographic and economic variables (columns).

import seaborn as sns

df_tips = sns.load_dataset('tips')
df_tips

df1.jpeg

Grouping data to summarise the information helps you identify conclusions. For example, the summary below shows that Dinners on Sundays come to the best customers because they:

  1. Spend more on average (\$21.41)
  2. Give more tips on average (\$3.25)
  3. Come more people at the same table on average (\$2.84)
df_tips.groupby(by=['day', 'time'])\
    .mean()\
    .fillna(0)\
    .style.format('{:.2f}').background_gradient(axis=0)

df2.jpeg

df_tips.groupby(by=['day', 'time'])\
    .mean()\
    .fillna(0)\
    .style.format('{:.2f}').bar(axis=0, width=50, align='zero')

df3.jpeg

Let's dig into the details of the .groupby() function from the basics in the following sections.

Grouping by 1 Column

We use the .groupby() function to generate an object that contains as many DataFrames as categories are in the column.

df_tips.groupby('sex')

As we have two groups in sex (Female and Male), the length of the DataFrameGroupBy object returned by the groupby() function is 2:

len(df_tips.groupby('sex'))

How can we work with the object DataFrameGroupBy?

Calculate the Average for All Columns

We use the .mean() function to get the average of the numerical columns for the two groups:

df_tips.groupby('sex').mean()

df4.jpeg

A pretty and simple syntax to summarise the information, right?

  • But what's going on inside the DataFrameGroupBy object?
df_tips.groupby('sex')
df_grouped = df_tips.groupby('sex')

The DataFrameGroupBy object contains 2 DataFrames. To see one of them DataFrame you need to use the function .get_group() and pass the group whose DataFrame you'd like to return:

df_grouped.get_group('Male')

df5.jpeg

df_grouped.get_group('Female')

df6.jpeg

As the DataFrameGroupBy distinguish the categories, at the moment we apply an aggregation function (click here to see a list of them), we will get the mathematical operations for those groups separately:

df_grouped.mean()

df7.jpeg

We could apply the function to each DataFrame separately. Although it is not the point of the .groupby() function.

df_grouped.get_group('Male').mean(numeric_only=True)
df_grouped.get_group('Female').mean(numeric_only=True)

Compute Functions to 1 Column

To get the results for just 1 column of interest, we access the column:

df_grouped.total_bill

And use the aggregation function we wish, .sum() in this case:

df_grouped.total_bill.sum()

We get the result for just 1 column (total_bill) because the DataFrames generated at .groupby() are accessed as if they were simple DataFrames:

df_grouped.get_group('Female')

df8.jpeg

df_grouped.get_group('Female').total_bill
df_grouped.get_group('Female').total_bill.sum()
df_grouped.get_group('Male').total_bill.sum()
df_grouped.total_bill.sum()

Grouping by 2 Columns

So far, we have summarised the data based on the categories of just one column. But, what if we'd like to summarise the data based on the combinations of the categories within different categorical columns?

Compute 1 Function

df_tips.groupby(by=['day', 'smoker']).sum()

df9.jpeg

Pivot Tables

We could have also used another function .pivot_table() to get the same numbers:

df_tips.pivot_table(index='day', columns='smoker', aggfunc='sum')

df10.jpeg

Which one is best?

  • I leave it up to your choice; I'd prefer to use the .pivot_table() because the syntax makes it more accessible.

Compute More than 1 Function

The thing doesn't stop here; we can even compute different aggregation functions at the same time:

Groupby

df_tips.groupby(by=['day', 'smoker'])\
    .total_bill\
    .agg(func=['sum', 'mean'])

df11.jpeg

Pivot Table

df_tips.pivot_table(index='day', columns='smoker',
                    values='total_bill', aggfunc=['sum', 'mean'])

df12.jpeg

dfres = df_tips.pivot_table(index='day', columns='smoker',
                    values='total_bill', aggfunc=['sum', 'mean'])

You could even style the output DataFrame:

dfres.style.background_gradient()

df13.jpeg

For me, it's nicer than styling the .groupby() returned DataFrame.

As we say in Spain:

Pa' gustos los colores!

df_tips.groupby(by=['day', 'smoker']).total_bill.agg(func=['sum', 'mean'])

df14.jpeg

dfres = df_tips.groupby(by=['day', 'smoker']).total_bill.agg(func=['sum', 'mean'])
dfres.style.background_gradient()

df15.jpeg

Pivot Tables in Depth

We can compute more than one mathematical operation:

df_tips.pivot_table(index='sex', columns='time',
                    aggfunc=['sum', 'mean'], values='total_bill')

df16.jpeg

And use more than one column in each of the parameters:

df_tips.pivot_table(index='sex', columns='time',
                    aggfunc=['sum', 'mean'], values=['total_bill', 'tip'])

df17.jpeg

df_tips.pivot_table(index=['day', 'smoker'], columns='time',
                    aggfunc=['sum', 'mean'], values=['total_bill', 'tip'])

df18.jpeg

df_tips.pivot_table(index=['day', 'smoker'], columns=['time', 'sex'],
                    aggfunc=['sum', 'mean'], values=['total_bill', 'tip'])

df19.jpeg

The .size() Function

.groupby()

1 Variable to Group By

The .size() is a function used to count the number of rows (observations) in each of the DataFrames generated by .groupby().

df_grouped.size()

2 Variables to Group By

df_tips.groupby(by=['sex', 'time']).size()

.pivot_table()

We can use .pivot_table() to represent the data clearer:

df_tips.pivot_table(index='sex', columns='time', aggfunc='size')

df20.jpeg

Other Example 1

df_tips.pivot_table(index='smoker', columns=['day', 'sex'],aggfunc='size')

df21.jpeg

dfres = df_tips.pivot_table(index='smoker', columns=['day', 'sex'], aggfunc='size')
dfres.style.background_gradient()

df22.jpeg

Other Example 2

df_tips.pivot_table(index=['day', 'time'], columns=['smoker', 'sex'], aggfunc='size')

df23.jpeg

dfres = df_tips.pivot_table(index=['day', 'time'], columns=['smoker', 'sex'], aggfunc='size')
dfres.style.background_gradient()

df24.jpeg

We can even choose the way we'd like to gradient colour the cells:

  • axis=1: the upper value between the columns of the same row
  • axis=2: the upper value between the rows of the same column
dfres.style.background_gradient(axis=1)

df25.jpeg

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Did you find this article valuable?

Support Jesús López by becoming a sponsor. Any amount is appreciated!

See recent sponsors Learn more about Hashnode Sponsors
 
Share this