Tidy Data Intro

Here's an example of a not-tidy dataset. The columns are storing values of an implicit variable, income. This violates the "variables are in columns" principle of tidy data.

In [29]:
import pandas as pd
pew = pd.read_csv("https://raw.githubusercontent.com/nickhould/tidy-data-python/master/data/pew-raw.csv")
pew
Out[29]:
religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k
0 Agnostic 27 34 60 81 76 137
1 Atheist 12 27 37 52 35 70
2 Buddhist 27 21 30 34 33 58
3 Catholic 418 617 732 670 638 1116
4 Dont know/refused 15 14 15 11 10 35
5 Evangelical Prot 575 869 1064 982 881 1486
6 Hindu 1 9 7 9 11 34
7 Historically Black Prot 228 244 236 238 197 223
8 Jehovahs Witness 20 27 24 24 21 30
9 Jewish 19 19 25 25 30 95

We can fix this using the melt function in pandas. This function is important. You will use it over and over for tidying.

In [15]:
tidy_pew = pd.melt(pew, id_vars=["religion"], var_name="income")
tidy_pew
Out[15]:
religion income value
0 Agnostic <$10k 27
1 Atheist <$10k 12
2 Buddhist <$10k 27
3 Catholic <$10k 418
4 Dont know/refused <$10k 15
5 Evangelical Prot <$10k 575
6 Hindu <$10k 1
7 Historically Black Prot <$10k 228
8 Jehovahs Witness <$10k 20
9 Jewish <$10k 19
10 Agnostic $10-20k 34
11 Atheist $10-20k 27
12 Buddhist $10-20k 21
13 Catholic $10-20k 617
14 Dont know/refused $10-20k 14
15 Evangelical Prot $10-20k 869
16 Hindu $10-20k 9
17 Historically Black Prot $10-20k 244
18 Jehovahs Witness $10-20k 27
19 Jewish $10-20k 19
20 Agnostic $20-30k 60
21 Atheist $20-30k 37
22 Buddhist $20-30k 30
23 Catholic $20-30k 732
24 Dont know/refused $20-30k 15
25 Evangelical Prot $20-30k 1064
26 Hindu $20-30k 7
27 Historically Black Prot $20-30k 236
28 Jehovahs Witness $20-30k 24
29 Jewish $20-30k 25
30 Agnostic $30-40k 81
31 Atheist $30-40k 52
32 Buddhist $30-40k 34
33 Catholic $30-40k 670
34 Dont know/refused $30-40k 11
35 Evangelical Prot $30-40k 982
36 Hindu $30-40k 9
37 Historically Black Prot $30-40k 238
38 Jehovahs Witness $30-40k 24
39 Jewish $30-40k 25
40 Agnostic $40-50k 76
41 Atheist $40-50k 35
42 Buddhist $40-50k 33
43 Catholic $40-50k 638
44 Dont know/refused $40-50k 10
45 Evangelical Prot $40-50k 881
46 Hindu $40-50k 11
47 Historically Black Prot $40-50k 197
48 Jehovahs Witness $40-50k 21
49 Jewish $40-50k 30
50 Agnostic $50-75k 137
51 Atheist $50-75k 70
52 Buddhist $50-75k 58
53 Catholic $50-75k 1116
54 Dont know/refused $50-75k 35
55 Evangelical Prot $50-75k 1486
56 Hindu $50-75k 34
57 Historically Black Prot $50-75k 223
58 Jehovahs Witness $50-75k 30
59 Jewish $50-75k 95

If ever wanted to go back to the earlier format, you can use pivot. This will only rarely be the case though (e.g., you decide to run some specialized algorithm that expects different income levels in columns).

In [26]:
pivot_pew = (pd.pivot(tidy_pew, index="religion", columns="income")
               .reset_index())
pivot_pew
Out[26]:
religion value
income $10-20k $40-50k <$10k $20-30k $30-40k $50-75k
0 Agnostic 34 76 27 60 81 137
1 Atheist 27 35 12 37 52 70
2 Buddhist 21 33 27 30 34 58
3 Catholic 617 638 418 732 670 1116
4 Dont know/refused 14 10 15 15 11 35
5 Evangelical Prot 869 881 575 1064 982 1486
6 Hindu 9 11 1 7 9 34
7 Historically Black Prot 244 197 228 236 238 223
8 Jehovahs Witness 27 21 20 24 24 30
9 Jewish 19 30 19 25 25 95