Tidying Case Study: Billboard Data

The billboard data includes the rankings for songs over time.

In [264]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

billboard = pd.read_csv("https://raw.githubusercontent.com/hadley/tidy-data/master/data/billboard.csv", encoding="ISO-8859-1") # <-- I literally copied the "encoding" part from stackoverflow

Right now, weeks are stored as different variables, and the individual songs are observations. Let's instead consider the song's appearance each week as a separate observation. This takes the week_number from being an implicit to an explicit variable.

In [265]:
billboard = pd.melt(
    billboard,
    id_vars=billboard.columns[:7],
    var_name="week",
    value_name="rank"
  )
billboard
Out[265]:
year artist.inverted track time genre date.entered date.peaked week rank
0 2000 Destiny's Child Independent Women Part I 3:38 Rock 2000-09-23 2000-11-18 x1st.week 78.0
1 2000 Santana Maria, Maria 4:18 Rock 2000-02-12 2000-04-08 x1st.week 15.0
2 2000 Savage Garden I Knew I Loved You 4:07 Rock 1999-10-23 2000-01-29 x1st.week 71.0
3 2000 Madonna Music 3:45 Rock 2000-08-12 2000-09-16 x1st.week 41.0
4 2000 Aguilera, Christina Come On Over Baby (All I Want Is You) 3:38 Rock 2000-08-05 2000-10-14 x1st.week 57.0
... ... ... ... ... ... ... ... ... ...
24087 2000 Ghostface Killah Cherchez LaGhost 3:04 R&B 2000-08-05 2000-08-05 x76th.week NaN
24088 2000 Smith, Will Freakin' It 3:58 Rap 2000-02-12 2000-02-12 x76th.week NaN
24089 2000 Zombie Nation Kernkraft 400 3:30 Rock 2000-09-02 2000-09-02 x76th.week NaN
24090 2000 Eastsidaz, The Got Beef 3:58 Rap 2000-07-01 2000-07-01 x76th.week NaN
24091 2000 Fragma Toca's Miracle 3:22 R&B 2000-10-28 2000-10-28 x76th.week NaN

24092 rows × 9 columns

Let's fix the types. The week should be an integer, and the dates should be dates.

In [221]:
billboard.dtypes
Out[221]:
year                 int64
artist.inverted     object
track               object
time                object
genre               object
date.entered        object
date.peaked         object
week                object
rank               float64
dtype: object
In [222]:
def parse_week(x):
    x = x.str.extract(r"([0-9]+)")
    return pd.to_numeric(x[0])

def minutes(time):
    parsed_time = pd.to_datetime(billboard["time"], format="%M:%S")
    ref_time = pd.to_datetime("00:00", format="%M:%S")
    return (parsed_time - ref_time).dt.total_seconds() / 60

billboard = (billboard.assign(
      week=lambda df: parse_week(df["week"]),
      entered=lambda df: pd.to_datetime(df["date.entered"]),
      peaked=lambda df: pd.to_datetime(df["date.peaked"]),
      minutes=lambda df: minutes(billboard["time"])
    ).dropna()
    .assign(
      rank=lambda df: df["rank"].astype(int),
      inverse_rank=lambda df: 100 - df["rank"]
    ))

billboard.dtypes
Out[222]:
year                        int64
artist.inverted            object
track                      object
time                       object
genre                      object
date.entered               object
date.peaked                object
week                        int64
rank                        int64
entered            datetime64[ns]
peaked             datetime64[ns]
minutes                   float64
inverse_rank                int64
dtype: object
In [223]:
billboard = billboard.sort_values(["year", "artist.inverted", "track", "week"])
billboard
Out[223]:
year artist.inverted track time genre date.entered date.peaked week rank entered peaked minutes inverse_rank
246 2000 2 Pac Baby Don't Cry (Keep Ya Head Up II) 4:22 Rap 2000-02-26 2000-03-11 1 87 2000-02-26 2000-03-11 4.366667 13
563 2000 2 Pac Baby Don't Cry (Keep Ya Head Up II) 4:22 Rap 2000-02-26 2000-03-11 2 82 2000-02-26 2000-03-11 4.366667 18
880 2000 2 Pac Baby Don't Cry (Keep Ya Head Up II) 4:22 Rap 2000-02-26 2000-03-11 3 72 2000-02-26 2000-03-11 4.366667 28
1197 2000 2 Pac Baby Don't Cry (Keep Ya Head Up II) 4:22 Rap 2000-02-26 2000-03-11 4 77 2000-02-26 2000-03-11 4.366667 23
1514 2000 2 Pac Baby Don't Cry (Keep Ya Head Up II) 4:22 Rap 2000-02-26 2000-03-11 5 87 2000-02-26 2000-03-11 4.366667 13
... ... ... ... ... ... ... ... ... ... ... ... ... ...
10793 2000 matchbox twenty Bent 4:12 Rock 2000-04-29 2000-07-22 35 33 2000-04-29 2000-07-22 4.200000 67
11110 2000 matchbox twenty Bent 4:12 Rock 2000-04-29 2000-07-22 36 37 2000-04-29 2000-07-22 4.200000 63
11427 2000 matchbox twenty Bent 4:12 Rock 2000-04-29 2000-07-22 37 38 2000-04-29 2000-07-22 4.200000 62
11744 2000 matchbox twenty Bent 4:12 Rock 2000-04-29 2000-07-22 38 38 2000-04-29 2000-07-22 4.200000 62
12061 2000 matchbox twenty Bent 4:12 Rock 2000-04-29 2000-07-22 39 48 2000-04-29 2000-07-22 4.200000 52

5307 rows × 13 columns

This will be much easier to make timepoint-level plots with. For example, here is how the rank of songs varies after they enter the billboard list, colored in by

In [232]:
plot = sns.relplot(
    x="week",
    y="inverse_rank",
    units="track",
    col="genre",
    kind="line",
    hue="minutes",
    hue_norm=plt.Normalize(1, 5),
    alpha=0.6,
    estimator=None,
    col_wrap=3,
    data=billboard
)

plot
Out[232]:
<seaborn.axisgrid.FacetGrid at 0x144f4f9e8>

The reason this was not too complicated to make was that we could refer directly to a week variable in the x-axis. This variable didn't exist before we tided the data!

A final consideration. Notice how the data are sort of redundant: We repeat the same information about the songs, each week they appear on the billboard.

In [162]:
billboard
Out[162]:
year artist.inverted track time genre date.entered date.peaked week rank entered peaked inverse_rank
246 2000 2 Pac Baby Don't Cry (Keep Ya Head Up II) 4:22 Rap 2000-02-26 2000-03-11 1 87 2000-02-26 2000-03-11 13
563 2000 2 Pac Baby Don't Cry (Keep Ya Head Up II) 4:22 Rap 2000-02-26 2000-03-11 2 82 2000-02-26 2000-03-11 18
880 2000 2 Pac Baby Don't Cry (Keep Ya Head Up II) 4:22 Rap 2000-02-26 2000-03-11 3 72 2000-02-26 2000-03-11 28
1197 2000 2 Pac Baby Don't Cry (Keep Ya Head Up II) 4:22 Rap 2000-02-26 2000-03-11 4 77 2000-02-26 2000-03-11 23
1514 2000 2 Pac Baby Don't Cry (Keep Ya Head Up II) 4:22 Rap 2000-02-26 2000-03-11 5 87 2000-02-26 2000-03-11 13
... ... ... ... ... ... ... ... ... ... ... ... ...
10793 2000 matchbox twenty Bent 4:12 Rock 2000-04-29 2000-07-22 35 33 2000-04-29 2000-07-22 67
11110 2000 matchbox twenty Bent 4:12 Rock 2000-04-29 2000-07-22 36 37 2000-04-29 2000-07-22 63
11427 2000 matchbox twenty Bent 4:12 Rock 2000-04-29 2000-07-22 37 38 2000-04-29 2000-07-22 62
11744 2000 matchbox twenty Bent 4:12 Rock 2000-04-29 2000-07-22 38 38 2000-04-29 2000-07-22 62
12061 2000 matchbox twenty Bent 4:12 Rock 2000-04-29 2000-07-22 39 48 2000-04-29 2000-07-22 52

5307 rows × 12 columns

This is another instance of the "sample ambiguity" problem. One way to resolve this is to split the data into two tables (which can be joined as needed),

  • A songs table, giving attributes about each song
  • A ranks table, providing the rankings of different songs across weeks.

First, let's make the songs table. The idea is to select the variables about songs that are constant across all weeks.

In [259]:
song_cols = billboard.columns[:7]
songs = (billboard[song_cols].drop_duplicates()
          .sort_values(["genre", "artist.inverted", "track"])
          .assign(song_id=lambda df: df.reset_index().index))
songs.set_index("song_id").reset_index() # moves song_id to the front
Out[259]:
song_id year artist.inverted track time genre date.entered date.peaked
0 0 2000 Adkins, Trace More 3:05 Country 2000-04-29 2000-06-17
1 1 2000 Allan, Gary Smoke Rings In The Dark 4:18 Country 2000-01-22 2000-02-05
2 2 2000 Black, Clint Been There 5:28 Country 2000-02-19 2000-04-15
3 3 2000 Brock, Chad A Country Boy Can Survive 3:54 Country 2000-01-01 2000-01-08
4 4 2000 Brock, Chad Yes! 3:22 Country 2000-04-08 2000-07-01
... ... ... ... ... ... ... ... ...
312 312 2000 Wallflowers, The Sleepwalker 3:29 Rock 2000-10-28 2000-10-28
313 313 2000 Westlife Swear It Again 4:07 Rock 2000-04-01 2000-07-01
314 314 2000 Williams, Robbie Angels 3:56 Rock 1999-11-20 2000-01-22
315 315 2000 Zombie Nation Kernkraft 400 3:30 Rock 2000-09-02 2000-09-02
316 316 2000 matchbox twenty Bent 4:12 Rock 2000-04-29 2000-07-22

317 rows × 8 columns

Now, we can create a table that tracks the ranks for songs over weeks.

In [262]:
join_cols = ["year", "artist.inverted", "track"]
ranks = pd.merge(billboard, songs, on=join_cols)
ranks[["song_id", "week", "rank"]]
Out[262]:
song_id week rank
0 121 1 87
1 121 2 82
2 121 3 72
3 121 4 77
4 121 5 87
... ... ... ...
5302 316 35 33
5303 316 36 37
5304 316 37 38
5305 316 38 38
5306 316 39 48

5307 rows × 3 columns