Data Visualization with Matplotlib and Seaborn

Authors
Dr. Nicholas Del Grosso | Dr. Sangeetha Nandakumar | Dr. Ole Bialas | Dr. Atle E. Rimehaug

Setup

Import Libraries

import pandas as pd
import owncloud
from pathlib import Path

Download Data

Path('data').mkdir(exist_ok=True, parents=True)

owncloud.Client.from_public_link('https://uni-bonn.sciebo.de/s/dDoQxwqdYXAJpw5', folder_password="ibots"
).get_file('/', 'data/gapminder.csv')
True

Data visualizations are a key component of every scientific publication. In this session, we are going to learn how to visualize data using the libraries Matplotlib and Seaborn. We are going to explore the libraries and their interaction by analyzing the Gapminder data set which contains data on population size, life expectancy and fertility from 63 countries over a time span of 50 years.

Section 1: Plotting Time-Series Data with Matplotlib

One very common kind of data are time-series: sequential points, sampled at regular intervals. To plot time-series data, we can use the function plt.plot(x, y) where the variable x contains the time points and y contains the values sampled at those time points. To explore this, we are going to visualize changes in fertility rates over time in different countries. By default, Matplotlib will show plots immediately after executing a cell. Thus, if we wish to do several things, like drawing a plot and labeling it, we’ll have to include the respective commands in the same cell.

Code Description
from matplotlib import pyplot as plt Import pyplot module from the matplotlib library under the alias plt
plt.plot(x, y) Plot the points at the given x and y coordinates and connect them with a line
plt.plot(x, y, label="label1") Add the label "label1" to the plotted data
plt.plot(x, y, marker="x", linewidth=2) Mark the data points with a cross
plt.plot(x, y, color="black", linewidth=2) Set the color to "black" and the linewidth to 2
plt.legend() Add a legend that displays the labels in the plot
plt.xlabel("xval") Label the x-axis with "xval"
plt.ylabel("yval") Label the y-axis with "yval"

Exercise: Run the cell below to load the "gapminder.csv" file, assign it to a data frame df and filter it to extract the data for "Germany" and the "United States". Then, print the .head() of both data frames.

Solution
df = pd.read_csv("data/gapminder.csv")
df_ger = df[df["country"]=="Germany"]
df_usa = df[df["country"]=="United States"]
df_ger.head()
year country pop life_expect fertility continent
275 1955 Germany 70195612 69.1 2.30 Europe
276 1960 Germany 72480869 70.3 2.49 Europe
277 1965 Germany 75638851 70.8 2.32 Europe
278 1970 Germany 77783164 71.0 1.64 Europe
279 1975 Germany 78682325 72.5 1.52 Europe
df_usa.head()
year country pop life_expect fertility continent
671 1955 United States 165931000 69.49 3.706 North America
672 1960 United States 180671000 70.21 3.314 North America
673 1965 United States 194303000 70.76 2.545 North America
674 1970 United States 205052000 71.34 2.016 North America
675 1975 United States 215973000 73.38 1.788 North America

Exercise: Import the pyplot module from the matplotlib library under the alias plt.

Solution
import matplotlib.pyplot as plt

Example: Plot the population size ("pop") across time ("year") for Germany (df_ger) and label the x- and y-axis with "Year" and "Population Size".

plt.plot(df_ger['year'], df_ger['pop'])
plt.xlabel("Year")
plt.ylabel("Population Size")

Exercise: Plot the fertility rates ("fertility") across time ("year") for Germany (df_ger)

Solution
plt.plot(df_ger['year'], df_ger['fertility'])

Exercise: Plot the fertility rates ("fertility") across time ("year") for Germany (df_ger) and set the linewidth to 2.5 and the color to green.

Solution
plt.plot(df_ger['year'], df_ger['fertility'], color = 'g', linewidth = 2.5)

Exercise: Plot the fertility rates ("fertility") across time ("year") for the United States (df_usa) and add a marker.

Solution
plt.plot(df_usa['year'], df_usa['fertility'], linestyle = '--', marker = 'o')

Exercise: Plot the same data but add the labels "Year" and "Fertility Rate" to the x- and y-axis.

Solution
plt.plot(df_usa['year'], df_usa['fertility'], marker = 'x')
plt.xlabel('Year')
plt.ylabel('Fertilty rate')

Exercise: Plot the fertility rates ("fertility") across time ("year") for Germany (df_ger) and the United States (df_usa) and label the x- and y-axis with "Year" and "Fertility Rate". Add a label to each line and add a plt.legend().

Solution
plt.plot(df_ger['year'], df_ger['fertility'], color = 'g', linewidth = 2.5, label = 'Germany')
plt.plot(df_usa['year'], df_usa['fertility'], marker = 'x', label='USA')
plt.xlabel('Year')
plt.ylabel('Fertilty rate')
plt.legend()

Section 2: Creating Multiple Subplots

Often, we want to visualize different kinds of data side by side. For example, in our example data set, it would be interesting to plot the changes in fertility rate together with changes in population size. This can be done with the plt.subplot() function. plt.subplot takes three integer numbers as arguments. The first two numbers determine rows and columns in the grid of subplots. The third number determines the position of the current subplot in that grid, counting from left to right and from top to bottom. To create multiple subplots, we call plt.subplot() once, call the Matplotlib commands we want to execute (e.g. plt.plot() or plt.xlabel()) and then call plt.subplot() again to create the next one. Once we are done, we can save our result to an image using the plt.savefig() function.

Code Description
plt.subplots(2, 2, 1) Draw the first (i.e. upper left) subplot in a 2-by-2 grid
plt.subplots(2, 2, 4) Draw the fourth (i.e. lower right) subplot in a 2-by-2 grid
plt.tight_layout() Adjust the layout so that the subplots don’t overlap
plt.savefig("myfig.png", dpi=300) Store the current figure as "myfig.png" with a resolution of 300 dpi (dots per inch)

Exercises

Example: Create a 2-by-2 grid of (empty) subplots

plt.subplot(2,3,1)
plt.subplot(2,3,2)
plt.subplot(2,3,3)
plt.subplot(2,3,4)
plt.subplot(2,3,5)
plt.subplot(2,3,6)

Example: Create a 2-by-1 grid of (empty) subplots and label their x-axes as "x1" and "x2".

plt.subplot(2,1,1)
plt.xlabel("x1")
plt.subplot(2,1,2)
plt.xlabel("x2")

Exercise: Create a 1-by-2 grid of (empty) subplots

Solution
plt.subplot(1,2,1)
plt.ylabel('y1')
plt.subplot(1,2,2)
plt.ylabel('y2')
plt.tight_layout()

Exercise: Create a 1-by-2 grid of subplots. On the first one, plot the "fertility" rate and, one the second one, the population size ("pop") over time ("year") for Germany (df_ger). Label the y-axes "Fertility Rate" and "Population Size" and the x-axes "Year".

Solution
plt.subplot(1,2,1)
plt.plot(df_ger['year'], df_ger['fertility'])
plt.ylabel('Fertility Rate')

plt.subplot(1,2,2)
plt.plot(df_ger['year'], df_ger['pop'])
plt.ylabel('Population Size')

Exercise: Re-create the plot from Exercise 9 but call plt.tight_layout().

Solution
plt.subplot(1,2,1)
plt.plot(df_ger['year'], df_ger['fertility'])
plt.ylabel('Fertility Rate')
plt.subplot(1,2,2)
plt.plot(df_ger['year'], df_ger['pop'])
plt.ylabel('Population Size')
plt.tight_layout()

Exercise: Create a 3-by-1 grid of subplots. On the first one, plot the "fertility" rate and, on the second one, the population size ("pop") and on the third one the life expectancy ("life_expect") over time ("year") for the United States (df_usa). Label the x-axes with "Fertility Rate", "Population Size" and "Life Expectancy". Label the x-axis with "Year", only for the last subplot.

Solution
plt.subplot(3,1,1)
plt.plot(df_usa['year'], df_usa['fertility'])
plt.ylabel('Fertility Rate')
plt.subplot(3,1,2)
plt.plot(df_usa['year'], df_usa['pop'])
plt.ylabel('Population Size')
plt.subplot(3,1,3)
plt.plot(df_usa['year'], df_usa['life_expect'])
plt.ylabel('Life Expectancy')
plt.tight_layout()

Exercise: Re-create the figure from Exercise 11 and save it to a file called gapminder_us.png with 300 dpi. Then, open the image with your file browser to verify that the image was saved correctly.

Solution
plt.subplot(3,1,1)
plt.plot(df_usa['year'], df_usa['fertility'])
plt.ylabel('Fertility Rate')
plt.subplot(3,1,2)
plt.plot(df_usa['year'], df_usa['pop'])
plt.ylabel('Population Size')
plt.subplot(3,1,3)
plt.plot(df_usa['year'], df_usa['life_expect'])
plt.ylabel('Life Expectancy')
plt.tight_layout()

plt.savefig('gapminder_us.png', dpi=300)

Section 3: Plotting and Quantifying the Relationship between Variables

Visualizations are great for understanding the relationship between variables. However, we also may want to quantify that relationship statistically. A simple way to do this is to fit a linear model to the data. The linear model describes the relationship between a dependent variable x and an independent variable y as a line. That line has two parameters: the intercept a and the slope b. a is the value of y when x==0 and b is the change in y that corresponds to a unit change in x. In this section, we will use the linregress() function from the scipy.stats module to model the relationship between life expectancy and fertility in the Gapminder data set. Finally, we will plot the estimated linear model together with the data.

Code Description
plt.scatter(x, y) Create a scatter plot with points at the coordinates x and y
from scipy.stats import linregress Import the linregress function from the scipy package
results = linregress(x, y) Compute the linear regression between the variables x and y and assign the returned value to a variable results

Exercises

Example: Create a scatter plot to visualize the relationship between life expectancy ("life_expect") and fertility rates ("fertility") for the whole Gapminder data set (df) and label the axes with "Life Expectancy in Years" and "Fertility Rate".

plt.scatter(df['life_expect'], df['fertility'])
plt.xlabel('Life Expectancy in Years')
plt.ylabel('Fertility Rate')

Run the cell below to get the data for Asia and North America, respectively, and put them in separate data frames df_asia and df_noram.

df_asia = df[df['continent'] == 'Asia']
df_noram = df[df['continent'] == 'North America']

Exercise: Create a scatter plot to visualize the relationship between life expectancy ("life_expect") and fertility rates ("fertility") for Asia (df_asia) and label the axes with "Life Expectancy in Years" and "Fertility Rate".

Solution
plt.scatter(df_asia['life_expect'], df_asia['fertility'])
plt.xlabel('Life Expectancy in Years')
plt.ylabel('Fertility Rate')

Exercise: Create a scatter plot to visualize the relationship between life expectancy ("life_expect") and fertility rates ("fertility") for both Asia (df_asia) and North America (df_noram) together in the same plot. Label the axes with "Life Expectancy in Years" and "Fertility Rate".

Solution
plt.scatter(df_asia['life_expect'], df_asia['fertility'])
plt.scatter(df_noram['life_expect'], df_noram['fertility'])
plt.xlabel('Life Expectancy in Years')
plt.ylabel('Fertility Rate')

Exercise: Import the linregress function from the scipy.stats module.

Solution
from scipy.stats import linregress

Example: Compute the linear regression between life expectancy ("life_expect") and population size ("pop") and store the returned value in a variable called results. Then, extract the p-value which is stored in the 4th element of results, assign it to a variable p and print that variable.

results = linregress(df["life_expect"], df["pop"])
p = results[3]
p
0.10897571814366507

Exercise: Compute the linear regression between life expectancy ("life_expect") and fertility rate ("fertility") and store the returned value in a variable called results. Then, extract the slope and intercept which are stored in the 1st and 2nd element of results, assign them to variables slope and intercept print those variable.

Solution
results = linregress(df["life_expect"], df["fertility"])
slope, intercept = results[0], results[1]
slope, intercept
(-0.1591172687736899, 14.130790658504012)

Exercise: Execute the cell below to create the line() function which takes in x values, the intercept, and the slope, and returns the corresponding y-values. Then call line() and pass the life expectancy values in df["life_expect"], the intercept and the slope to obtain the fertility rates predicted by the model. Store those in a new variable called fertility_pred.

Solution
def line(x, slope, intercept):
    """
    Return points on a line for the given x coordinates.
    Arguments:
        x: x-coordinates of the data points.
        a: intercept of the line.
        b: slope of the line.
    Returns:
        y: y-coordinates of the data point.
    """
    y = intercept + slope * x
    return y
fertility_pred = line(df['life_expect'], slope, intercept)

Exercise: Plot the predicted fertility (fertility_pred) against the life expectance (df['life_expect]) in a standard line plot (plt.plot()). Label the x- and y-axis "Life Expectancy in Years" and "Fertility Rate of the Population", respectively.

Solution
plt.plot(df['life_expect'], fertility_pred)
plt.xlabel('Life Expectancy in Years')
plt.ylabel('Fertility Rate of the Population')

Exercise: Create a scatter plot to visualize the relationship between life expectancy ("life_expect") and fertility rates ("fertility"). Then, plt.plot() the predicted fertility rates against the life expectancy in the same plot. Finally, label the axes with "Life Expectancy in Years" and "Fertility Rate".

Hint: Change the color of the predicted fertility line if it’s hard to see.

Solution
plt.scatter(df['life_expect'], df['fertility'])
plt.plot(df['life_expect'], fertility_pred, 'r')
plt.xlabel('Life Expectancy in Years')
plt.ylabel('Fertility Rate')

Section 4: Combining Matplotlib with Seaborn

Matplotlib takes an imperative approach to visualization where we specify exactly what we want to draw. This provides fine-grained control but also requires you to write more code. In contrast, Seaborn takes a more declarative approach where we specify what data we want to visualize. This allows us to create detailed visualizations without having to worry about the low level implementation. Finally, we can combine the advantages both approaches and use Seaborn to create and Matplotlib to customize our visualizations. In this section, we will use Seaborn to create detailed visualizations of the Gapminder data. Then, we’ll plot these visualizations to subplots created with Matplotlib and customize them.

Code Description
sns.kdeplot(data=df, x="var1", hue="var2") Plot the kernel density estimate (kde) for variable "var1" from df and add a hue to encode "var2"
sns.scatterplot(data=df, x="var1", y="var2", hue="var3") Plot "var1" against "var2" in a scatterplot and add a hue to encode "var3"
ax1 = plt.subplot(1,2,1) Create the first subplot in a 1-by-2 grid and assign the returned object to a variable ax1
sns.scatterplot(data=df, x="var1", y="var2", ax=ax1) Plot "var1" against "var2" in a scatterplot on subplot ax1
ax1.annotate("X", xy=(0.5,0.5), xycoords="axes fraction", fontsize = 18) Plot the letter "X" on subplot ax1 at xy coordinates (0.5,0.5) defined in fractions of an axis (i.e. in the middle of the plot). Fontsize is an optional argument.

Exercise: Import the seaborn library under the alias sns

Solution
import seaborn as sns

Exercise: Create a sns.scatterplot() to visualize the relationship between life expectancy ("life_expect") and "fertility" rates for the Gapminder data and add hue to encode the "continent".

Solution
sns.scatterplot(df, x = 'life_expect', y = 'fertility', hue = 'continent')

Exercise: Create a sns.kdeplot() (kernel density estimate) to visualize the global distribution for "fertility" rates and add hue to encode the "year".

Solution
sns.kdeplot(df, x = 'fertility', hue='year')

Example: Create the first subplot in a 1-by-2 grid and assign the returned axes object to a variable called ax1. Then, create the scatter plot from Exercise 20 and draw it to the subplot by using the ax argument of sns.scatterplot().

ax1 = plt.subplot(1,2,1)
sns.scatterplot(x=df["life_expect"], y=df["fertility"], hue=df["continent"], ax=ax1)

Exercise: Create two subplots in a 1-by-2 grid and assign the returned axes to two variables ax1 and ax2. Then, create the scatter plot from Exercise 20 and the kde plot from Exercise 21 and draw them to the subplots ax1 and ax2 by using the ax argument of sns.scatterplot() and sns.kdeplot(). Use plt.tight_layout() if necessary.

Solution
ax1 = plt.subplot(1,2,1)
ax2 = plt.subplot(1,2,2)

sns.scatterplot(x=df["life_expect"], y=df["fertility"], hue=df["continent"], ax=ax1)
sns.kdeplot(df, x = 'fertility', hue='year', ax = ax2)

plt.tight_layout()

Exercise: Re-create the figure from Exercise 22 and use the .set() method to set the xlabel and ylabel of the scatter plot to "Life Expectancy in Years" and "Fertility Rate" and set the xlabel of the kde plot to "Fertility Rate".

Solution
ax1 = plt.subplot(1,2,1)
ax2 = plt.subplot(1,2,2)

sns.scatterplot(x=df["life_expect"], y=df["fertility"], hue=df["continent"], ax=ax1)
sns.kdeplot(df, x = 'fertility', hue='year', ax = ax2)

ax1.set(xlabel='Life Expectancy in Years', ylabel='Fertility Rate')
ax2.set(xlabel='Fertility Rate', ylabel='Density')

plt.tight_layout()

Exercise: Re-create the plot from Exercise 23 and use the .annotate() method to draw the letters "A" and "B" in the top left corner of the subplots.

Solution
ax1 = plt.subplot(1,2,1)
ax2 = plt.subplot(1,2,2)

sns.scatterplot(x=df["life_expect"], y=df["fertility"], hue=df["continent"], ax=ax1)
sns.kdeplot(df, x = 'fertility', hue='year', ax = ax2)

ax1.set(xlabel='Life Expectancy in Years', ylabel='Fertility Rate')
ax2.set(xlabel='Fertility Rate', ylabel='Density')

ax1.annotate('A', xy=[0.03,0.94], xycoords="axes fraction", fontsize = 18)
ax2.annotate('B', xy=[0.03,0.94], xycoords="axes fraction", fontsize = 18)

plt.tight_layout()