Tidying Case Study: Billboard Data¶

The billboard data includes the rankings for songs over time.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

billboard = pd.read_csv("https://raw.githubusercontent.com/hadley/tidy-data/master/data/billboard.csv", encoding="ISO-8859-1") # <-- I literally copied the "encoding" part from stackoverflow

Right now, weeks are stored as different variables, and the individual songs are observations. Let's instead consider the song's appearance each week as a separate observation. This takes the week_number from being an implicit to an explicit variable.

billboard = pd.melt(
    billboard,
    id_vars=billboard.columns[:7],
    var_name="week",
    value_name="rank"
  )
billboard

Let's fix the types. The week should be an integer, and the dates should be dates.

billboard.dtypes

year                 int64
artist.inverted     object
track               object
time                object
genre               object
date.entered        object
date.peaked         object
week                object
rank               float64
dtype: object

def parse_week(x):
    x = x.str.extract(r"([0-9]+)")
    return pd.to_numeric(x[0])

def minutes(time):
    parsed_time = pd.to_datetime(billboard["time"], format="%M:%S")
    ref_time = pd.to_datetime("00:00", format="%M:%S")
    return (parsed_time - ref_time).dt.total_seconds() / 60

billboard = (billboard.assign(
      week=lambda df: parse_week(df["week"]),
      entered=lambda df: pd.to_datetime(df["date.entered"]),
      peaked=lambda df: pd.to_datetime(df["date.peaked"]),
      minutes=lambda df: minutes(billboard["time"])
    ).dropna()
    .assign(
      rank=lambda df: df["rank"].astype(int),
      inverse_rank=lambda df: 100 - df["rank"]
    ))

billboard.dtypes

year                        int64
artist.inverted            object
track                      object
time                       object
genre                      object
date.entered               object
date.peaked                object
week                        int64
rank                        int64
entered            datetime64[ns]
peaked             datetime64[ns]
minutes                   float64
inverse_rank                int64
dtype: object

billboard = billboard.sort_values(["year", "artist.inverted", "track", "week"])
billboard

This will be much easier to make timepoint-level plots with. For example, here is how the rank of songs varies after they enter the billboard list, colored in by

plot = sns.relplot(
    x="week",
    y="inverse_rank",
    units="track",
    col="genre",
    kind="line",
    hue="minutes",
    hue_norm=plt.Normalize(1, 5),
    alpha=0.6,
    estimator=None,
    col_wrap=3,
    data=billboard
)

plot

<seaborn.axisgrid.FacetGrid at 0x144f4f9e8>

The reason this was not too complicated to make was that we could refer directly to a week variable in the x-axis. This variable didn't exist before we tided the data!

A final consideration. Notice how the data are sort of redundant: We repeat the same information about the songs, each week they appear on the billboard.

billboard

This is another instance of the "sample ambiguity" problem. One way to resolve this is to split the data into two tables (which can be joined as needed),

A songs table, giving attributes about each song
A ranks table, providing the rankings of different songs across weeks.

First, let's make the songs table. The idea is to select the variables about songs that are constant across all weeks.

song_cols = billboard.columns[:7]
songs = (billboard[song_cols].drop_duplicates()
          .sort_values(["genre", "artist.inverted", "track"])
          .assign(song_id=lambda df: df.reset_index().index))
songs.set_index("song_id").reset_index() # moves song_id to the front

Now, we can create a table that tracks the ranks for songs over weeks.

join_cols = ["year", "artist.inverted", "track"]
ranks = pd.merge(billboard, songs, on=join_cols)
ranks[["song_id", "week", "rank"]]

	year	artist.inverted	track	time	genre	date.entered	date.peaked	week	rank
0	2000	Destiny's Child	Independent Women Part I	3:38	Rock	2000-09-23	2000-11-18	x1st.week	78.0
1	2000	Santana	Maria, Maria	4:18	Rock	2000-02-12	2000-04-08	x1st.week	15.0
2	2000	Savage Garden	I Knew I Loved You	4:07	Rock	1999-10-23	2000-01-29	x1st.week	71.0
3	2000	Madonna	Music	3:45	Rock	2000-08-12	2000-09-16	x1st.week	41.0
4	2000	Aguilera, Christina	Come On Over Baby (All I Want Is You)	3:38	Rock	2000-08-05	2000-10-14	x1st.week	57.0
...	...	...	...	...	...	...	...	...	...
24087	2000	Ghostface Killah	Cherchez LaGhost	3:04	R&B	2000-08-05	2000-08-05	x76th.week	NaN
24088	2000	Smith, Will	Freakin' It	3:58	Rap	2000-02-12	2000-02-12	x76th.week	NaN
24089	2000	Zombie Nation	Kernkraft 400	3:30	Rock	2000-09-02	2000-09-02	x76th.week	NaN
24090	2000	Eastsidaz, The	Got Beef	3:58	Rap	2000-07-01	2000-07-01	x76th.week	NaN
24091	2000	Fragma	Toca's Miracle	3:22	R&B	2000-10-28	2000-10-28	x76th.week	NaN

	year	artist.inverted	track	time	genre	date.entered	date.peaked	week	rank	entered	peaked	minutes	inverse_rank
246	2000	2 Pac	Baby Don't Cry (Keep Ya Head Up II)	4:22	Rap	2000-02-26	2000-03-11	1	87	2000-02-26	2000-03-11	4.366667	13
563	2000	2 Pac	Baby Don't Cry (Keep Ya Head Up II)	4:22	Rap	2000-02-26	2000-03-11	2	82	2000-02-26	2000-03-11	4.366667	18
880	2000	2 Pac	Baby Don't Cry (Keep Ya Head Up II)	4:22	Rap	2000-02-26	2000-03-11	3	72	2000-02-26	2000-03-11	4.366667	28
1197	2000	2 Pac	Baby Don't Cry (Keep Ya Head Up II)	4:22	Rap	2000-02-26	2000-03-11	4	77	2000-02-26	2000-03-11	4.366667	23
1514	2000	2 Pac	Baby Don't Cry (Keep Ya Head Up II)	4:22	Rap	2000-02-26	2000-03-11	5	87	2000-02-26	2000-03-11	4.366667	13
...	...	...	...	...	...	...	...	...	...	...	...	...	...
10793	2000	matchbox twenty	Bent	4:12	Rock	2000-04-29	2000-07-22	35	33	2000-04-29	2000-07-22	4.200000	67
11110	2000	matchbox twenty	Bent	4:12	Rock	2000-04-29	2000-07-22	36	37	2000-04-29	2000-07-22	4.200000	63
11427	2000	matchbox twenty	Bent	4:12	Rock	2000-04-29	2000-07-22	37	38	2000-04-29	2000-07-22	4.200000	62
11744	2000	matchbox twenty	Bent	4:12	Rock	2000-04-29	2000-07-22	38	38	2000-04-29	2000-07-22	4.200000	62
12061	2000	matchbox twenty	Bent	4:12	Rock	2000-04-29	2000-07-22	39	48	2000-04-29	2000-07-22	4.200000	52

	year	artist.inverted	track	time	genre	date.entered	date.peaked	week	rank	entered	peaked	inverse_rank
246	2000	2 Pac	Baby Don't Cry (Keep Ya Head Up II)	4:22	Rap	2000-02-26	2000-03-11	1	87	2000-02-26	2000-03-11	13
563	2000	2 Pac	Baby Don't Cry (Keep Ya Head Up II)	4:22	Rap	2000-02-26	2000-03-11	2	82	2000-02-26	2000-03-11	18
880	2000	2 Pac	Baby Don't Cry (Keep Ya Head Up II)	4:22	Rap	2000-02-26	2000-03-11	3	72	2000-02-26	2000-03-11	28
1197	2000	2 Pac	Baby Don't Cry (Keep Ya Head Up II)	4:22	Rap	2000-02-26	2000-03-11	4	77	2000-02-26	2000-03-11	23
1514	2000	2 Pac	Baby Don't Cry (Keep Ya Head Up II)	4:22	Rap	2000-02-26	2000-03-11	5	87	2000-02-26	2000-03-11	13
...	...	...	...	...	...	...	...	...	...	...	...	...
10793	2000	matchbox twenty	Bent	4:12	Rock	2000-04-29	2000-07-22	35	33	2000-04-29	2000-07-22	67
11110	2000	matchbox twenty	Bent	4:12	Rock	2000-04-29	2000-07-22	36	37	2000-04-29	2000-07-22	63
11427	2000	matchbox twenty	Bent	4:12	Rock	2000-04-29	2000-07-22	37	38	2000-04-29	2000-07-22	62
11744	2000	matchbox twenty	Bent	4:12	Rock	2000-04-29	2000-07-22	38	38	2000-04-29	2000-07-22	62
12061	2000	matchbox twenty	Bent	4:12	Rock	2000-04-29	2000-07-22	39	48	2000-04-29	2000-07-22	52

	song_id	year	artist.inverted	track	time	genre	date.entered	date.peaked
0	0	2000	Adkins, Trace	More	3:05	Country	2000-04-29	2000-06-17
1	1	2000	Allan, Gary	Smoke Rings In The Dark	4:18	Country	2000-01-22	2000-02-05
2	2	2000	Black, Clint	Been There	5:28	Country	2000-02-19	2000-04-15
3	3	2000	Brock, Chad	A Country Boy Can Survive	3:54	Country	2000-01-01	2000-01-08
4	4	2000	Brock, Chad	Yes!	3:22	Country	2000-04-08	2000-07-01
...	...	...	...	...	...	...	...	...
312	312	2000	Wallflowers, The	Sleepwalker	3:29	Rock	2000-10-28	2000-10-28
313	313	2000	Westlife	Swear It Again	4:07	Rock	2000-04-01	2000-07-01
314	314	2000	Williams, Robbie	Angels	3:56	Rock	1999-11-20	2000-01-22
315	315	2000	Zombie Nation	Kernkraft 400	3:30	Rock	2000-09-02	2000-09-02
316	316	2000	matchbox twenty	Bent	4:12	Rock	2000-04-29	2000-07-22

	song_id	week	rank
0	121	1	87
1	121	2	82
2	121	3	72
3	121	4	77
4	121	5	87
...	...	...	...
5302	316	35	33
5303	316	36	37
5304	316	37	38
5305	316	38	38
5306	316	39	48