# #04 | Data Visualization in Python ## Possibilities Look at the following example as an aspiration you can achieve if you fully understand and replicate this whole tutorial with your data. Let's load a dataset that contains information from countries (rows) considering socio-demographic and economic variables (columns). ```python import plotly.express as px df_countries = px.data.gapminder() df_countries ``` ![df1.jpeg](https://cdn.hashnode.com/res/hashnode/image/upload/v1659450030628/93RuebFN-.jpeg align="center") Python contains 3 main libraries for Data Visualization: 1. **Matplotlib** (Mathematical Plotting) 2. **Seaborn** (High-Level based on Matplotlib) 3. **Plotly** (Animated Plots) I love `plotly` because the Visualizations are interactive; you may hover the mouse over the points to get information from them: ```python df_countries_2007 = df_countries.query('year == 2007') px.scatter(data_frame=df_countries_2007, x='gdpPercap', y='lifeExp', color='continent', hover_name='country', size='pop') ``` You can even animate the plots with a simple parameter. Click on play ↓ PS: The following example is taken from the [official plotly library website](https://plotly.com/python/animations/): ```python px.scatter(df_countries, x="gdpPercap", y="lifeExp", animation_frame="year", animation_group="country", size="pop", color="continent", hover_name="country", log_x=True, size_max=55, range_x=[100,100000], range_y=[25,90]) ``` In this article, we'll dig into the details of Data Visualization in Python to build up the required knowledge and develop awesome visualizations like the ones we saw before. ## Matplotlib Matplotlib is a library used for Data Visualization. We use the **sublibrary** (module) `pyplot` from `matplotlib` library to access the functions. ```python import matplotlib.pyplot as plt ``` Let's make a bar plot: ```python plt.bar(x=['Real Madrid', 'Barcelona', 'Bayern Munich'], height=[14, 5, 6]); ``` ![plot3.jpeg](https://cdn.hashnode.com/res/hashnode/image/upload/v1659450042002/-BHmEW0R1.jpeg align="center") We could have also done a point plot: ```python plt.scatter(x=['Real Madrid', 'Barcelona', 'Bayern Munich'], y=[14, 5, 6]); ``` ![plot4.jpeg](https://cdn.hashnode.com/res/hashnode/image/upload/v1659450049027/K9r0_sItv.jpeg align="center") But it doesn't make sense with the data we have represented. ## Visualize DataFrame Let's create a DataFrame: ```python teams = ['Real Madrid', 'Barcelona', 'Bayern Munich'] uefa_champions = [14, 5, 6] import pandas as pd df_champions = pd.DataFrame(data={'Team': teams, 'UEFA Champions': uefa_champions}) df_champions ``` ![df2.jpeg](https://cdn.hashnode.com/res/hashnode/image/upload/v1659450056486/ZZrHv6YS9.jpeg align="center") And visualize it using: ### Matplotlib functions ```python plt.bar(x=df_champions['Team'], height=df_champions['UEFA Champions']); ``` ![plot5.jpeg](https://cdn.hashnode.com/res/hashnode/image/upload/v1659450062363/H8tIp55VX.jpeg align="center") ### DataFrame functions ```python df_champions.plot.bar(x='Team', y='UEFA Champions'); ``` ![plot6.jpeg](https://cdn.hashnode.com/res/hashnode/image/upload/v1659450067540/yaWpEAiBo.jpeg align="center") ## Seaborn Let's read another dataset: the Football Premier League classification for 2021/2022. ```python df_premier = pd.read_excel(io='../data/premier_league.xlsx') df_premier ``` ![df3.jpeg](https://cdn.hashnode.com/res/hashnode/image/upload/v1659450074319/K13-WzKIn.jpeg align="center") We will visualize a point plot, from now own **scatter plot** to check if there is a relationship between the number of goals scored `F` versus the Points `Pts`. ```python import seaborn as sns sns.scatterplot(x='F', y='Pts', data=df_premier); ``` ![plot7.jpeg](https://cdn.hashnode.com/res/hashnode/image/upload/v1659450084414/6pEO0Tnz8.jpeg align="center") Can we do the same plot with matplotlib `plt` library? ```python plt.scatter(x='F', y='Pts', data=df_premier); ``` ![plot8.jpeg](https://cdn.hashnode.com/res/hashnode/image/upload/v1659450090274/CTVwvSXeW.jpeg align="center") Which are the differences between them? 1. The points: `matplotlib` points are bigger than `seaborn` ones 2. The axis labels: `matplotlib` axis labels are non-existent, whereas `seaborn` places the names of the columns From which library do the previous functions return the objects? ```python seaborn_plot = sns.scatterplot(x='F', y='Pts', data=df_premier); ``` ![plot9.jpeg](https://cdn.hashnode.com/res/hashnode/image/upload/v1659450095630/8TK71W4WW.jpeg align="center") ```python matplotlib_plot = plt.scatter(x='F', y='Pts', data=df_premier); ``` ![plot10.jpeg](https://cdn.hashnode.com/res/hashnode/image/upload/v1659450100318/72r2ihwNA.jpeg align="center") ```python type(seaborn_plot) ``` matplotlib.axes.\_subplots.AxesSubplot ```python type(matplotlib_plot) ``` matplotlib.collections.PathCollection Why does `seaborn` returns a `matplotlib` object? Quoted from the [seaborn](https://seaborn.pydata.org/) official website: > Seaborn is a Python data visualization library **based on matplotlib**. It provides a **high-level\* interface** for drawing attractive and informative statistical graphics. \*High-level means the communication between humans and the computer is easier to understand than low-level communication, which goes through 0s and 1s. Could you place the names of the teams in the points? ```python plt.scatter(x='F', y='Pts', data=df_premier) for idx, data in df_premier.iterrows(): plt.text(x=data['F'], y=data['Pts'], s=data['Team']) ``` ![plot11.jpeg](https://cdn.hashnode.com/res/hashnode/image/upload/v1659450109125/lhXGDo_gn.jpeg align="center") It isn't straightforward. Is there an easier way? Yes, you may use an interactive plot with `plotly` library and display the name of the Team as you hover the mouse on a point. ## Plotly We use the `express` module within `plotly` library to access the functions of the plots: ```python import plotly.express as px px.scatter(data_frame=df_premier, x='F', y='Pts', hover_name='Team') ``` > Learn how to become an independent Data Analyist programmer who knows [how to extract meaningful insights from Data Visualizations](https://courses.resolvingpython.com/data-analysis-visualization). ## Types of Plots Let's read another dataset: the sociological data of clients in a restaurant. ```python df_tips = sns.load_dataset(name='tips') df_tips ``` ![df4.jpeg](https://cdn.hashnode.com/res/hashnode/image/upload/v1659450118522/mOoblL0AU.jpeg align="center") ### One Column #### Categorical Column ```python df_tips.sex ``` 0 Female 1 Male ... 242 Male 243 Female Name: sex, Length: 244, dtype: category Categories (2, object): \['Male', 'Female'\] We need to summarise the data first; we count how many `Female` and `Male` people are in the dataset. ```python df_tips.sex.value_counts() ``` Male 157 Female 87 Name: sex, dtype: int64 ```python sr_sex = df_tips.sex.value_counts() ``` ##### Barplot Let's place bars equal to the number of people from each gender: ```python px.bar(x=sr_sex.index, y=sr_sex.values) ``` We can also colour the bars based on the category: ```python px.bar(x=sr_sex.index, y=sr_sex.values, color=sr_sex.index) ``` ##### Pie plot Let's put the same data into a pie plot: ```python px.pie(names=sr_sex.index, values=sr_sex.values, color=sr_sex.index) ``` #### Numerical Column ```python df_tips.total_bill ``` 0 16.99 1 10.34 ... 242 17.82 243 18.78 Name: total\_bill, Length: 244, dtype: float64 ##### Histogram Instead of observing the numbers, we can visualize the distribution of the bills in a **histogram**. We can observe that most people pay between 10 and 20 dollars. Whereas a few are between 40 and 50. ```python px.histogram(x=df_tips.total_bill) ``` We can also create a **boxplot** where the limits of the boxes indicate the 1st and 3rd quartiles. The 1st quartile is 13.325, and the 3rd quartile is 24.175. Therefore, 50% of people were billed an amount between these limits. ##### Boxplot ```python px.box(x=df_tips.total_bill) ``` ### Two Columns ```python df_tips[['total_bill', 'tip']] ``` ![df5.jpeg](https://cdn.hashnode.com/res/hashnode/image/upload/v1659450139175/3HOR3PCVa.jpeg align="center") #### Numerical & Numerical We use a scatter plot to see if a relationship exists between two numerical variables. Do the points go up as you move the eyes from left to right? As you may observe in the following plot: the higher the amount of the bill, the higher the tip the clients leave for the staff. ```python px.scatter(x='total_bill', y='tip', data_frame=df_tips) ``` Another type of visualization for 2 continuous variables: ```python px.density_contour(x='total_bill', y='tip', data_frame=df_tips) ``` #### Numerical & Categorical ```python df_tips[['day', 'total_bill']] ``` ![df6.jpeg](https://cdn.hashnode.com/res/hashnode/image/upload/v1659450147573/Jlfy9YY8T.jpeg align="center") We can summarise the data around how much revenue was generated in each day of the week. ```python df_tips.groupby('day').total_bill.sum() ``` day Thur 1096.33 Fri 325.88 Sat 1778.40 Sun 1627.16 Name: total\_bill, dtype: float64 ```python sr_days = df_tips.groupby('day').total_bill.sum() ``` We can observe that Saturday is the most profitable day as people have spent more money. ```python px.bar(x=sr_days.index, y=sr_days.values) ``` ```python px.bar(x=sr_days.index, y=sr_days.values, color=sr_days.index) ``` #### Categorical & Categorical ```python df_tips[['day', 'size']] ``` ![df7.jpeg](https://cdn.hashnode.com/res/hashnode/image/upload/v1659450157108/3O2gIAOJf.jpeg align="center") Which combination of day-size is the most frequent table you can observe in the restaurant? The following plot shows that Saturdays with 2 people at the table is the most common phenomenon at the restaurant. They could create an advertisement that targets couples to have dinner on Saturdays and make more money. ```python px.density_heatmap(x='day', y='size', data_frame=df_tips) ``` ## Awesome Plots The following examples are taken directly from [plotly](https://plotly.com/python/). ```python df_gapminder = px.data.gapminder() px.scatter_geo(df_gapminder, locations="iso_alpha", color="continent", #! hover_name="country", size="pop", animation_frame="year", projection="natural earth") ``` ```python import plotly.express as px df = px.data.election() geojson = px.data.election_geojson() fig = px.choropleth_mapbox(df, geojson=geojson, color="Bergeron", locations="district", featureidkey="properties.district", center={"lat": 45.5517, "lon": -73.7073}, mapbox_style="carto-positron", zoom=9) fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0}) ``` ```python import plotly.express as px df = px.data.election() geojson = px.data.election_geojson() fig = px.choropleth_mapbox(df, geojson=geojson, color="winner", locations="district", featureidkey="properties.district", center={"lat": 45.5517, "lon": -73.7073}, mapbox_style="carto-positron", zoom=9) fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0}) ```