Intro to Visualization: Gapminder

Let's check out the gapminder data. We'll take log-transformations of GDP and population, since we care more about relative rather than absolute differences in these quantities.

In [148]:
import pandas as pd
import seaborn as sns
import numpy as np

gapminder = pd.read_csv("https://raw.githubusercontent.com/OHI-Science/data-science-training/master/data/gapminder.csv")
gapminder = gapminder.assign(
  log_gdp=lambda df: np.log(df["gdpPercap"]),
  log_pop=lambda df: np.log(df["pop"]),
  decade=lambda df: np.floor(df["year"] / 10) * 10
)
gapminder
Out[148]:
country year pop continent lifeExp gdpPercap log_gdp log_pop decade
0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314 6.658583 15.946754 1950.0
1 Afghanistan 1957 9240934.0 Asia 30.332 820.853030 6.710344 16.039154 1950.0
2 Afghanistan 1962 10267083.0 Asia 31.997 853.100710 6.748878 16.144454 1960.0
3 Afghanistan 1967 11537966.0 Asia 34.020 836.197138 6.728864 16.261154 1960.0
4 Afghanistan 1972 13079460.0 Asia 36.088 739.981106 6.606625 16.386554 1970.0
... ... ... ... ... ... ... ... ... ...
1699 Zimbabwe 1987 9216418.0 Africa 62.351 706.157306 6.559838 16.036497 1980.0
1700 Zimbabwe 1992 10704340.0 Africa 60.377 693.420786 6.541637 16.186160 1990.0
1701 Zimbabwe 1997 11404948.0 Africa 46.809 792.449960 6.675129 16.249558 1990.0
1702 Zimbabwe 2002 11926563.0 Africa 39.989 672.038623 6.510316 16.294279 2000.0
1703 Zimbabwe 2007 12311143.0 Africa 43.487 469.709298 6.152114 16.326015 2000.0

1704 rows × 9 columns

Let's focus on just one year (2002), to make these initial plots a bit easier to understand.

In [82]:
gapminder_sub = gapminder.loc[gapminder["year"] == 2002]
gapminder_sub
Out[82]:
country year pop continent lifeExp gdpPercap log_gdp log_pop
10 Afghanistan 2002 25268405.0 Asia 42.129 726.734055 6.588561 17.045065
22 Albania 2002 3508512.0 Europe 75.651 4604.211737 8.434727 15.070703
34 Algeria 2002 31287142.0 Africa 70.994 5288.040382 8.573203 17.258718
46 Angola 2002 10866106.0 Africa 41.003 2773.287312 7.927789 16.201159
58 Argentina 2002 38331121.0 Americas 74.340 8797.640716 9.082239 17.461773
... ... ... ... ... ... ... ... ...
1654 Vietnam 2002 80908147.0 Asia 73.017 1764.456677 7.475598 18.208825
1666 West Bank and Gaza 2002 3389578.0 Asia 72.370 4515.487575 8.415268 15.036216
1678 Yemen Rep. 2002 18701257.0 Asia 60.308 2234.820827 7.711916 16.744101
1690 Zambia 2002 10595811.0 Africa 39.193 1071.613938 6.976921 16.175969
1702 Zimbabwe 2002 11926563.0 Africa 39.989 672.038623 6.510316 16.294279

142 rows × 8 columns

One question we can ask is, how is GDP related to life expectancy?

In [45]:
sns.scatterplot(x="log_gdp", y="lifeExp", data=gapminder_sub)
Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x126f603c8>

How is population related? One answer is to encode population with the size of the points.

In [47]:
sns.scatterplot(x="log_gdp", y="lifeExp", size="log_pop", data=gapminder_sub)
Out[47]:
<matplotlib.axes._subplots.AxesSubplot at 0x1271d61d0>

Each point here is a country. On what continents are these countries?

In [48]:
sns.scatterplot(
    x="log_gdp",
    y="lifeExp",
    size="log_pop",
    hue="continent",
    data=gapminder_sub
)
Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x12721d9b0>

It can sometimes be easier to understand variation across groups by explicitly separating them into different panels.

In [54]:
sns.relplot(
    x="log_gdp",
    y="lifeExp",
    size="log_pop",
    hue="continent",
    col="continent",
    col_wrap=3,
    data=gapminder_sub
)
Out[54]:
<seaborn.axisgrid.FacetGrid at 0x12871b668>

Let's make a plot of an ordinal (country) against a continuous (population) variable.

In [89]:
gapminder_sub = gapminder_sub.sort_values("log_pop") # <-- sorts the countries
plot = sns.relplot(
  x="country",
  y="log_pop",
  col="continent",
  hue="continent",
  facet_kws={"sharex": False},
  data=gapminder_sub
)

for ax in plot.axes.flat:
    for label in ax.get_xticklabels():
        label.set_rotation(90)

Just for fun, how did populations change over time?

In [93]:
gapminder = gapminder.sort_values(["year", "log_pop"]) # <-- sorts the countries
plot = sns.relplot(
    x="country",
    y="log_pop",
    col="continent",
    hue="year",
    facet_kws={"sharex": False},
    data=gapminder
)

for ax in plot.axes.flat:
    for label in ax.get_xticklabels():
        label.set_rotation(90)

So far, we've been looking at the original datapoints, mostly filtered down to one year. Now that we're looking at all the timepoints, we see that it might be useful to reduce raw points to distributional summaries. Let's look at a histogram of the log GDP.

In [194]:
sns.distplot(gapminder["log_gdp"], rug=True, kde=False, bins=40)
Out[194]:
<matplotlib.axes._subplots.AxesSubplot at 0x1526b6ac8>

Histograms count how many samples fall into a bin along one axis. The corresponding plot in two dimensions is called a "hexbin" plot.

In [195]:
sns.jointplot(x="log_gdp", y="lifeExp", data=gapminder, kind="hex", joint_kws={"gridsize": 40})
Out[195]:
<seaborn.axisgrid.JointGrid at 0x1525a4320>

How do things change over time, in the different regions? A first attempt might be to make an array of histograms, with time along the columns and continents arranged along columns.

In [172]:
plot = sns.FacetGrid(gapminder, col="continent", hue="continent", row="decade", height=2, gridspec_kws={"hspace": 0.01, "wspace":0.0})
plot = plot.map(sns.distplot, "log_gdp", bins=20, kde=False)
plot.set_titles("")
plot.set_axis_labels("")
Out[172]:
<seaborn.axisgrid.FacetGrid at 0x14d3fa7f0>

This has some interesting information, but is extremely unwieldy (the mostly empty column for Oceania is especially irritating). Let's see how a violin plot can communicate the same information much more succinctly.

In [177]:
plot = sns.FacetGrid(gapminder, col="continent", hue="continent")
plot = plot.map(sns.violinplot, "decade", "log_gdp")

for ax in plot.axes.flat:
    for label in ax.get_xticklabels():
        label.set_rotation(90)