Here's an example of a not-tidy dataset. The columns are storing values of an
implicit variable, income
. This violates the "variables are in columns"
principle of tidy data.
import pandas as pd
pew = pd.read_csv("https://raw.githubusercontent.com/nickhould/tidy-data-python/master/data/pew-raw.csv")
pew
We can fix this using the melt
function in pandas. This function is
important. You will use it over and over for tidying.
tidy_pew = pd.melt(pew, id_vars=["religion"], var_name="income")
tidy_pew
If ever wanted to go back to the earlier format, you can use pivot
. This will
only rarely be the case though (e.g., you decide to run some specialized
algorithm that expects different income levels in columns).
pivot_pew = (pd.pivot(tidy_pew, index="religion", columns="income")
.reset_index())
pivot_pew