Intro to Visualization: Gapminder¶

Let's check out the gapminder data. We'll take log-transformations of GDP and population, since we care more about relative rather than absolute differences in these quantities.

import pandas as pd
import seaborn as sns
import numpy as np

gapminder = pd.read_csv("https://raw.githubusercontent.com/OHI-Science/data-science-training/master/data/gapminder.csv")
gapminder = gapminder.assign(
  log_gdp=lambda df: np.log(df["gdpPercap"]),
  log_pop=lambda df: np.log(df["pop"]),
  decade=lambda df: np.floor(df["year"] / 10) * 10
)
gapminder

Let's focus on just one year (2002), to make these initial plots a bit easier to understand.

gapminder_sub = gapminder.loc[gapminder["year"] == 2002]
gapminder_sub

One question we can ask is, how is GDP related to life expectancy?

sns.scatterplot(x="log_gdp", y="lifeExp", data=gapminder_sub)

<matplotlib.axes._subplots.AxesSubplot at 0x126f603c8>

How is population related? One answer is to encode population with the size of the points.

sns.scatterplot(x="log_gdp", y="lifeExp", size="log_pop", data=gapminder_sub)

<matplotlib.axes._subplots.AxesSubplot at 0x1271d61d0>

Each point here is a country. On what continents are these countries?

sns.scatterplot(
    x="log_gdp",
    y="lifeExp",
    size="log_pop",
    hue="continent",
    data=gapminder_sub
)

<matplotlib.axes._subplots.AxesSubplot at 0x12721d9b0>

It can sometimes be easier to understand variation across groups by explicitly separating them into different panels.

sns.relplot(
    x="log_gdp",
    y="lifeExp",
    size="log_pop",
    hue="continent",
    col="continent",
    col_wrap=3,
    data=gapminder_sub
)

<seaborn.axisgrid.FacetGrid at 0x12871b668>

Let's make a plot of an ordinal (country) against a continuous (population) variable.

gapminder_sub = gapminder_sub.sort_values("log_pop") # <-- sorts the countries
plot = sns.relplot(
  x="country",
  y="log_pop",
  col="continent",
  hue="continent",
  facet_kws={"sharex": False},
  data=gapminder_sub
)

for ax in plot.axes.flat:
    for label in ax.get_xticklabels():
        label.set_rotation(90)

Just for fun, how did populations change over time?

gapminder = gapminder.sort_values(["year", "log_pop"]) # <-- sorts the countries
plot = sns.relplot(
    x="country",
    y="log_pop",
    col="continent",
    hue="year",
    facet_kws={"sharex": False},
    data=gapminder
)

for ax in plot.axes.flat:
    for label in ax.get_xticklabels():
        label.set_rotation(90)

So far, we've been looking at the original datapoints, mostly filtered down to one year. Now that we're looking at all the timepoints, we see that it might be useful to reduce raw points to distributional summaries. Let's look at a histogram of the log GDP.

sns.distplot(gapminder["log_gdp"], rug=True, kde=False, bins=40)

<matplotlib.axes._subplots.AxesSubplot at 0x1526b6ac8>

Histograms count how many samples fall into a bin along one axis. The corresponding plot in two dimensions is called a "hexbin" plot.

sns.jointplot(x="log_gdp", y="lifeExp", data=gapminder, kind="hex", joint_kws={"gridsize": 40})

<seaborn.axisgrid.JointGrid at 0x1525a4320>

How do things change over time, in the different regions? A first attempt might be to make an array of histograms, with time along the columns and continents arranged along columns.

plot = sns.FacetGrid(gapminder, col="continent", hue="continent", row="decade", height=2, gridspec_kws={"hspace": 0.01, "wspace":0.0})
plot = plot.map(sns.distplot, "log_gdp", bins=20, kde=False)
plot.set_titles("")
plot.set_axis_labels("")

<seaborn.axisgrid.FacetGrid at 0x14d3fa7f0>

This has some interesting information, but is extremely unwieldy (the mostly empty column for Oceania is especially irritating). Let's see how a violin plot can communicate the same information much more succinctly.

plot = sns.FacetGrid(gapminder, col="continent", hue="continent")
plot = plot.map(sns.violinplot, "decade", "log_gdp")

for ax in plot.axes.flat:
    for label in ax.get_xticklabels():
        label.set_rotation(90)

	country	year	pop	continent	lifeExp	gdpPercap	log_gdp	log_pop	decade
0	Afghanistan	1952	8425333.0	Asia	28.801	779.445314	6.658583	15.946754	1950.0
1	Afghanistan	1957	9240934.0	Asia	30.332	820.853030	6.710344	16.039154	1950.0
2	Afghanistan	1962	10267083.0	Asia	31.997	853.100710	6.748878	16.144454	1960.0
3	Afghanistan	1967	11537966.0	Asia	34.020	836.197138	6.728864	16.261154	1960.0
4	Afghanistan	1972	13079460.0	Asia	36.088	739.981106	6.606625	16.386554	1970.0
...	...	...	...	...	...	...	...	...	...
1699	Zimbabwe	1987	9216418.0	Africa	62.351	706.157306	6.559838	16.036497	1980.0
1700	Zimbabwe	1992	10704340.0	Africa	60.377	693.420786	6.541637	16.186160	1990.0
1701	Zimbabwe	1997	11404948.0	Africa	46.809	792.449960	6.675129	16.249558	1990.0
1702	Zimbabwe	2002	11926563.0	Africa	39.989	672.038623	6.510316	16.294279	2000.0
1703	Zimbabwe	2007	12311143.0	Africa	43.487	469.709298	6.152114	16.326015	2000.0

	country	year	pop	continent	lifeExp	gdpPercap	log_gdp	log_pop
10	Afghanistan	2002	25268405.0	Asia	42.129	726.734055	6.588561	17.045065
22	Albania	2002	3508512.0	Europe	75.651	4604.211737	8.434727	15.070703
34	Algeria	2002	31287142.0	Africa	70.994	5288.040382	8.573203	17.258718
46	Angola	2002	10866106.0	Africa	41.003	2773.287312	7.927789	16.201159
58	Argentina	2002	38331121.0	Americas	74.340	8797.640716	9.082239	17.461773
...	...	...	...	...	...	...	...	...
1654	Vietnam	2002	80908147.0	Asia	73.017	1764.456677	7.475598	18.208825
1666	West Bank and Gaza	2002	3389578.0	Asia	72.370	4515.487575	8.415268	15.036216
1678	Yemen Rep.	2002	18701257.0	Asia	60.308	2234.820827	7.711916	16.744101
1690	Zambia	2002	10595811.0	Africa	39.193	1071.613938	6.976921	16.175969
1702	Zimbabwe	2002	11926563.0	Africa	39.989	672.038623	6.510316	16.294279