Let's check out the gapminder data. We'll take log-transformations of GDP and population, since we care more about relative rather than absolute differences in these quantities.
import pandas as pd
import seaborn as sns
import numpy as np
gapminder = pd.read_csv("https://raw.githubusercontent.com/OHI-Science/data-science-training/master/data/gapminder.csv")
gapminder = gapminder.assign(
log_gdp=lambda df: np.log(df["gdpPercap"]),
log_pop=lambda df: np.log(df["pop"]),
decade=lambda df: np.floor(df["year"] / 10) * 10
)
gapminder
Let's focus on just one year (2002), to make these initial plots a bit easier to understand.
gapminder_sub = gapminder.loc[gapminder["year"] == 2002]
gapminder_sub
One question we can ask is, how is GDP related to life expectancy?
sns.scatterplot(x="log_gdp", y="lifeExp", data=gapminder_sub)
How is population related? One answer is to encode population with the size of the points.
sns.scatterplot(x="log_gdp", y="lifeExp", size="log_pop", data=gapminder_sub)
Each point here is a country. On what continents are these countries?
sns.scatterplot(
x="log_gdp",
y="lifeExp",
size="log_pop",
hue="continent",
data=gapminder_sub
)
It can sometimes be easier to understand variation across groups by explicitly separating them into different panels.
sns.relplot(
x="log_gdp",
y="lifeExp",
size="log_pop",
hue="continent",
col="continent",
col_wrap=3,
data=gapminder_sub
)
Let's make a plot of an ordinal (country) against a continuous (population) variable.
gapminder_sub = gapminder_sub.sort_values("log_pop") # <-- sorts the countries
plot = sns.relplot(
x="country",
y="log_pop",
col="continent",
hue="continent",
facet_kws={"sharex": False},
data=gapminder_sub
)
for ax in plot.axes.flat:
for label in ax.get_xticklabels():
label.set_rotation(90)
Just for fun, how did populations change over time?
gapminder = gapminder.sort_values(["year", "log_pop"]) # <-- sorts the countries
plot = sns.relplot(
x="country",
y="log_pop",
col="continent",
hue="year",
facet_kws={"sharex": False},
data=gapminder
)
for ax in plot.axes.flat:
for label in ax.get_xticklabels():
label.set_rotation(90)
So far, we've been looking at the original datapoints, mostly filtered down to one year. Now that we're looking at all the timepoints, we see that it might be useful to reduce raw points to distributional summaries. Let's look at a histogram of the log GDP.
sns.distplot(gapminder["log_gdp"], rug=True, kde=False, bins=40)
Histograms count how many samples fall into a bin along one axis. The corresponding plot in two dimensions is called a "hexbin" plot.
sns.jointplot(x="log_gdp", y="lifeExp", data=gapminder, kind="hex", joint_kws={"gridsize": 40})
How do things change over time, in the different regions? A first attempt might be to make an array of histograms, with time along the columns and continents arranged along columns.
plot = sns.FacetGrid(gapminder, col="continent", hue="continent", row="decade", height=2, gridspec_kws={"hspace": 0.01, "wspace":0.0})
plot = plot.map(sns.distplot, "log_gdp", bins=20, kde=False)
plot.set_titles("")
plot.set_axis_labels("")
This has some interesting information, but is extremely unwieldy (the mostly empty column for Oceania is especially irritating). Let's see how a violin plot can communicate the same information much more succinctly.
plot = sns.FacetGrid(gapminder, col="continent", hue="continent")
plot = plot.map(sns.violinplot, "decade", "log_gdp")
for ax in plot.axes.flat:
for label in ax.get_xticklabels():
label.set_rotation(90)