The billboard data includes the rankings for songs over time.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
billboard = pd.read_csv("https://raw.githubusercontent.com/hadley/tidy-data/master/data/billboard.csv", encoding="ISO-8859-1") # <-- I literally copied the "encoding" part from stackoverflow
Right now, weeks are stored as different variables, and the individual songs are
observations. Let's instead consider the song's appearance each week as a
separate observation. This takes the week_number
from being an implicit to an
explicit variable.
billboard = pd.melt(
billboard,
id_vars=billboard.columns[:7],
var_name="week",
value_name="rank"
)
billboard
Let's fix the types. The week should be an integer, and the dates should be dates.
billboard.dtypes
def parse_week(x):
x = x.str.extract(r"([0-9]+)")
return pd.to_numeric(x[0])
def minutes(time):
parsed_time = pd.to_datetime(billboard["time"], format="%M:%S")
ref_time = pd.to_datetime("00:00", format="%M:%S")
return (parsed_time - ref_time).dt.total_seconds() / 60
billboard = (billboard.assign(
week=lambda df: parse_week(df["week"]),
entered=lambda df: pd.to_datetime(df["date.entered"]),
peaked=lambda df: pd.to_datetime(df["date.peaked"]),
minutes=lambda df: minutes(billboard["time"])
).dropna()
.assign(
rank=lambda df: df["rank"].astype(int),
inverse_rank=lambda df: 100 - df["rank"]
))
billboard.dtypes
billboard = billboard.sort_values(["year", "artist.inverted", "track", "week"])
billboard
This will be much easier to make timepoint-level plots with. For example, here is how the rank of songs varies after they enter the billboard list, colored in by
plot = sns.relplot(
x="week",
y="inverse_rank",
units="track",
col="genre",
kind="line",
hue="minutes",
hue_norm=plt.Normalize(1, 5),
alpha=0.6,
estimator=None,
col_wrap=3,
data=billboard
)
plot
The reason this was not too complicated to make was that we could refer directly
to a week
variable in the x-axis. This variable didn't exist before we tided
the data!
A final consideration. Notice how the data are sort of redundant: We repeat the same information about the songs, each week they appear on the billboard.
billboard
This is another instance of the "sample ambiguity" problem. One way to resolve this is to split the data into two tables (which can be joined as needed),
songs
table, giving attributes about each songranks
table, providing the rankings of different songs across weeks.First, let's make the songs table. The idea is to select the variables about songs that are constant across all weeks.
song_cols = billboard.columns[:7]
songs = (billboard[song_cols].drop_duplicates()
.sort_values(["genre", "artist.inverted", "track"])
.assign(song_id=lambda df: df.reset_index().index))
songs.set_index("song_id").reset_index() # moves song_id to the front
Now, we can create a table that tracks the ranks for songs over weeks.
join_cols = ["year", "artist.inverted", "track"]
ranks = pd.merge(billboard, songs, on=join_cols)
ranks[["song_id", "week", "rank"]]