Tidy Data Intro¶

Here's an example of a not-tidy dataset. The columns are storing values of an implicit variable, income. This violates the "variables are in columns" principle of tidy data.

In [29]:

import pandas as pd
pew = pd.read_csv("https://raw.githubusercontent.com/nickhould/tidy-data-python/master/data/pew-raw.csv")
pew

Out[29]:

	religion	<$10k	$10-20k	$20-30k	$30-40k	$40-50k	$50-75k
0	Agnostic	27	34	60	81	76	137
1	Atheist	12	27	37	52	35	70
2	Buddhist	27	21	30	34	33	58
3	Catholic	418	617	732	670	638	1116
4	Dont know/refused	15	14	15	11	10	35
5	Evangelical Prot	575	869	1064	982	881	1486
6	Hindu	1	9	7	9	11	34
7	Historically Black Prot	228	244	236	238	197	223
8	Jehovahs Witness	20	27	24	24	21	30
9	Jewish	19	19	25	25	30	95

We can fix this using the melt function in pandas. This function is important. You will use it over and over for tidying.

In [15]:

tidy_pew = pd.melt(pew, id_vars=["religion"], var_name="income")
tidy_pew

Out[15]:

	religion	income	value
0	Agnostic	<$10k	27
1	Atheist	<$10k	12
2	Buddhist	<$10k	27
3	Catholic	<$10k	418
4	Dont know/refused	<$10k	15
5	Evangelical Prot	<$10k	575
6	Hindu	<$10k	1
7	Historically Black Prot	<$10k	228
8	Jehovahs Witness	<$10k	20
9	Jewish	<$10k	19
10	Agnostic	$10-20k	34
11	Atheist	$10-20k	27
12	Buddhist	$10-20k	21
13	Catholic	$10-20k	617
14	Dont know/refused	$10-20k	14
15	Evangelical Prot	$10-20k	869
16	Hindu	$10-20k	9
17	Historically Black Prot	$10-20k	244
18	Jehovahs Witness	$10-20k	27
19	Jewish	$10-20k	19
20	Agnostic	$20-30k	60
21	Atheist	$20-30k	37
22	Buddhist	$20-30k	30
23	Catholic	$20-30k	732
24	Dont know/refused	$20-30k	15
25	Evangelical Prot	$20-30k	1064
26	Hindu	$20-30k	7
27	Historically Black Prot	$20-30k	236
28	Jehovahs Witness	$20-30k	24
29	Jewish	$20-30k	25
30	Agnostic	$30-40k	81
31	Atheist	$30-40k	52
32	Buddhist	$30-40k	34
33	Catholic	$30-40k	670
34	Dont know/refused	$30-40k	11
35	Evangelical Prot	$30-40k	982
36	Hindu	$30-40k	9
37	Historically Black Prot	$30-40k	238
38	Jehovahs Witness	$30-40k	24
39	Jewish	$30-40k	25
40	Agnostic	$40-50k	76
41	Atheist	$40-50k	35
42	Buddhist	$40-50k	33
43	Catholic	$40-50k	638
44	Dont know/refused	$40-50k	10
45	Evangelical Prot	$40-50k	881
46	Hindu	$40-50k	11
47	Historically Black Prot	$40-50k	197
48	Jehovahs Witness	$40-50k	21
49	Jewish	$40-50k	30
50	Agnostic	$50-75k	137
51	Atheist	$50-75k	70
52	Buddhist	$50-75k	58
53	Catholic	$50-75k	1116
54	Dont know/refused	$50-75k	35
55	Evangelical Prot	$50-75k	1486
56	Hindu	$50-75k	34
57	Historically Black Prot	$50-75k	223
58	Jehovahs Witness	$50-75k	30
59	Jewish	$50-75k	95

If ever wanted to go back to the earlier format, you can use pivot. This will only rarely be the case though (e.g., you decide to run some specialized algorithm that expects different income levels in columns).

In [26]:

pivot_pew = (pd.pivot(tidy_pew, index="religion", columns="income")
               .reset_index())
pivot_pew

Out[26]:

	religion	value
income		$10-20k	$40-50k	<$10k	$20-30k	$30-40k	$50-75k
0	Agnostic	34	76	27	60	81	137
1	Atheist	27	35	12	37	52	70
2	Buddhist	21	33	27	30	34	58
3	Catholic	617	638	418	732	670	1116
4	Dont know/refused	14	10	15	15	11	35
5	Evangelical Prot	869	881	575	1064	982	1486
6	Hindu	9	11	1	7	9	34
7	Historically Black Prot	244	197	228	236	238	223
8	Jehovahs Witness	27	21	20	24	24	30
9	Jewish	19	30	19	25	25	95