Statistically Significant

Andrew Landgraf's Blog

Downloading and Analyzing CD1025’s Playlist

| Comments

CD1025 is an “alternative” radio station here in Columbus. They are one of the few remaining radio stations that are independently owned and they take great pride in it. For data nerds like me, they also put a real time list of recently played songs on their website. The page has the most recent 50 songs played, but you can also click on “Older Tracks” to go back in time. When you do this, the URL ends “now-playing/?start=50”. If you got back again, it says “now-playing/?start=100”.

Using this structure, I decided to see if I could download all of their historical data and see how far it goes back. In the code below, I use the XML package to go to the website and download the 50 songs and then increment the number by 50 to find the previous 50 songs. I am telling the code to keep doing this until I get to January 1, 2012.
library(ggplot2)
theme_set(theme_bw())
library(XML)
library(lubridate)
library(sqldf)
startNum = 0
while (TRUE) {
theurl <- paste0("http://cd1025.com/about/playlists/now-playing/?start=",
startNum)
table <- readHTMLTable(theurl, stringsAsFactors = FALSE)[[1]]
if (startNum == 0) {
playlist = table[, -1]
} else {
playlist = rbind(playlist, table[, -1])
}
dt = mdy(substring(table[1, 4], nchar(table[1, 4]) - 9, nchar(table[1, 4])))
print(dt)
if (dt < mdy("1/1/12")) {
break
}
startNum = startNum + 50
}

playlist = unique(playlist) # Remove Dupes

write.csv(playlist, "CD101Playlist.csv", row.names = FALSE)
This takes a while and is fairly large. My file has over 150,000 songs. If you want just a little data, change the date to last week or so. The first thing I will do is parse the dates and times of the songs, order them, and look at the first few songs. You can see that data only goes back to March of 2012.
dates = mdy(substring(playlist[, 3], nchar(playlist[, 3]) - 9, nchar(playlist[, 
3])))
times = hm(substring(playlist[, 3], 1, nchar(playlist[, 3]) - 10))
playlist$Month = ymd(paste(year(dates), month(dates), "1", sep = "-"))
playlist$Day = dates
playlist$Time = times
playlist = playlist[order(playlist$Day, playlist$Time), ]
head(playlist)
##                     Artist                Song       Last.Played
## 151638 DEATH CAB FOR CUTIE YOU ARE A TOURIST 12:34am03/01/2012
## 151637 SLEEPER AGENT GET BURNED 12:38am03/01/2012
## 151636 WASHED OUT AMOR FATI 12:41am03/01/2012
## 151635 COLDPLAY CHARLIE BROWN 12:45am03/01/2012
## 151634 GROUPLOVE TONGUE TIED 12:49am03/01/2012
## 151633 SUGAR YOUR FAVORITE THING 12:52am03/01/2012
## Month Day Time
## 151638 2012-03-01 2012-03-01 34M 0S
## 151637 2012-03-01 2012-03-01 38M 0S
## 151636 2012-03-01 2012-03-01 41M 0S
## 151635 2012-03-01 2012-03-01 45M 0S
## 151634 2012-03-01 2012-03-01 49M 0S
## 151633 2012-03-01 2012-03-01 52M 0S
Using the sqldf package, I can easily see what the most played artists and songs are from the data I scraped.
sqldf("Select Artist, Count(Artist) as PlayCount
From playlist
Group By Artist
Order by PlayCount DESC
Limit 10")
##                   Artist PlayCount
## 1 SILVERSUN PICKUPS 2340
## 2 THE BLACK KEYS 2203
## 3 MUSE 1988
## 4 THE SHINS 1885
## 5 OF MONSTERS AND MEN 1753
## 6 PASSION PIT 1552
## 7 GROUPLOVE 1544
## 8 RED HOT CHILI PEPPERS 1514
## 9 METRIC 1495
## 10 ATLAS GENIUS 1494

sqldf("Select Artist, Song, Count(Song) as PlayCount
From playlist
Group By Artist, Song
Order by PlayCount DESC
Limit 10")
##                 Artist                    Song PlayCount
## 1 PASSION PIT TAKE A WALK 828
## 2 SILVERSUN PICKUPS PIT, THE 825
## 3 ATLAS GENIUS TROJANS 819
## 4 WALK THE MOON ANNA SUN 742
## 5 THE BLACK KEYS LITTLE BLACK SUBMARINES 736
## 6 DIVINE FITS WOULD THAT NOT BE NICE 731
## 7 THE LUMINEERS HO HEY 722
## 8 CAPITAL CITIES SAFE AND SOUND 712
## 9 OF MONSTERS AND MEN MOUNTAIN SOUND 711
## 10 ALT J BREEZEBLOCKS 691
I am a little surprised that Silversun Pickups are the number one band, but everyone on the list makes sense. Looking at how the plays of the top artists have varied from month to month, you can see a few patterns. Muse has been more popular recently and The Shins and Grouplove have lost some steam.
artist.month=sqldf("Select Month, Artist, Count(Song) as Num
From playlist
Group By Month, Artist
Order by Month, Artist")
artist=sqldf("Select Artist, Count(Artist) as Num
From playlist
Group By Artist
Order by Num DESC")
p=ggplot(subset(artist.month,Artist %in% head(artist$Artist,8)),aes(Month,Num))
p+geom_bar(stat="identity",aes(fill=Artist),position='fill',colour="grey")+
 labs(y="Percentage of Plays")

For the play count of the top artists, I see some odd numbers in June and July of 2012. The number of plays went way down.
p + geom_area(aes(fill = Artist), position = "stack", colour = 1) + labs(y = "Number of Plays")


Looking into this further, I plotted the date and time that the song was played by the cumulative number of songs played since the beginning of the list. The plot should be a line with a constant slope, meaning that the plays per day are relatively constant. You can see in June and July of 2012 there are flat spots where there is no playlist history.
qplot(playlist$Day + playlist$Time, 1:length(dates), geom = "path")

There are also smaller flat spots in September and December, but I am going to decide that those are small enough not to affect any further analyses. From this, I am going to use data only from August 2012 to present.
playlist = subset(playlist, Day >= mdy("8/1/12"))
Next up, I am going to use this data to analyze the plays of artists from Summerfest, and try to infer if the play counts varied once they were added to the bill.

What Is the Probability of a 16 Seed Beating a 1 Seed?

| Comments

Note: I started this post way back when the NCAA men’s basketball tournament was going on, but didn’t finish it until now.

Since the NCAA Men’s Basketball Tournament has moved to 64 teams, a 16 seed as never upset a 1 seed. You might be tempted to say that the probability of such an event must be 0 then. But we know better than that.

In this post, I am interested in looking at different ways of estimating how the odds of winning a game change as the difference between seeds increases. I was able to download tournament data going back to the 1930s until 2012 from hoopstournament.net/Database.html. The tournament expanded to 64 teams in 1985, which is what I used for this post. I only used match ups in which one of the seeds was higher than the other because this was the easiest way to remove duplicates. (The database has each game listed twice, once with the winner as the first team and once with the loser as the first team. The vast majority (98.9%) of games had one team as a higher seed because an equal seed can only happen at the Final Four or later.)

library(ggplot2); theme_set(theme_bw())
brackets=read.csv("NCAAHistory.csv")
 
# use only data from 1985 on in which the first team has the higher seed
brackets=subset(brackets,Seed<Opponent.Seed & Year>=1985 & Round!="Opening Round")
 
brackets$SeedDiff=abs(brackets$Opponent.Seed-brackets$Seed)
brackets$HigherSeedWon=ifelse(brackets$Opponent.Seed>brackets$Seed,brackets$Wins,brackets$Losses)
brackets$HigherSeedScoreDiff=ifelse(brackets$Opponent.Seed>brackets$Seed,1,-1)*(brackets$Score-brackets$Opponent.Score)
Created by Pretty R at inside-R.org

Use Frequencies

The first way is the most simple: look at the historical records when a 16 seed is playing a 1 seed (where the seed difference is 15). As you can see from the plot below, when the seed difference is 15, the higher seeded team has won every time. This is also true when the seed difference is 12, although there have only been 4 games in this scenario. Another oddity is that when the seed difference is 10, the higher seed only has only won 50% of the time. Again, this is largely due to the fact that there have only been 6 games with this seed difference.

seed.diffs=sort(unique(brackets$SeedDiff))
win.pct=sapply(seed.diffs,function(x) mean(brackets$HigherSeedWon[brackets$SeedDiff==x]))
ggplot(data=data.frame(seed.diffs,win.pct),aes(seed.diffs,win.pct))+geom_point()+
geom_hline(yintercept=0.5,linetype=2)+
geom_line()+labs(x="Seed Difference",y="Proportion of Games Won by Higher Seed")
Created by Pretty R at inside-R.org


Use Score Difference

In many applications, it has been shown that using margin of victory is much more reliable than just wins and losses. For example, in the computer ranking of College Football teams, using score differences is more accurate, but outlawed for fear that teams would run up the score on weaker opponents. So the computer rankings are not as strong as they could be.

We have no such conflict of interest, so we should try to make use of any information available. A simple way to do that is to look at the mean and standard deviation of the margin of victory when the 16 seed is playing the 1 seed. Below is a plot of the mean score difference with a ribbon for the +/- 2 standard deviations.

seed.diffs=sort(unique(brackets$SeedDiff))
means=sapply(seed.diffs,function(x) mean(brackets$HigherSeedScoreDiff[brackets$SeedDiff==x]))
sds=sapply(seed.diffs,function(x) sd(brackets$HigherSeedScoreDiff[brackets$SeedDiff==x]))
ggplot(data=data.frame(seed.diffs,means,sds),aes(seed.diffs,means))+
geom_ribbon(aes(ymin=means-2*sds,ymax=means+2*sds),alpha=.5)+geom_point()+geom_line()+
geom_hline(yintercept=0,linetype=2)+
labs(x="Seed Difference",y="Margin of Victory by Higher Seed")
Created by Pretty R at inside-R.org


You can see that the ribbon includes zero for all seed differences except 15. If we assume that the score differences are roughly normal, we can calculate the probability that the score difference will be greater than 0. The results are largely the same as before, but we see now that there are no 100% estimates. Also, the 50% win percentage for a seed difference of 10 now looks a little more reasonable, albeit still out of line with the rest.

ggplot(data=data.frame(seed.diffs,means,sds),aes(seed.diffs,1-pnorm(0,means,sds)))+
geom_point()+geom_line()+geom_hline(yintercept=0.5,linetype=2)+
labs(x="Seed Difference",y="Probability of Higher Seed Winning Based on Margin of Victory")
Created by Pretty R at inside-R.org

Model Win Percentage as a Function of  Seed Difference

It is always good to incorporate as much knowledge as possible into an analysis. In this case, we have information on other games besides the 16 versus 1 seed game which help us estimate the 16 versus 1 game. For example, it is reasonable to assume that the larger the difference in seed is, the more likely the higher seed will win. We can build a logistic regression model which looks at all of the outcomes of all of the games and predicts the probability of winning based on the difference in seed. When the two teams have the same seed, I enforced the probability of the higher seed winning to be 0.5 by making the intercept 0.

In the plot below, you can see that the logistic model predicts that the probability of winning increases throughout until reaching about 90% for the 16 versus 1. I also included a non-linear generalized additive model (GAM) model for comparison. The GAM believes that being a big favorite (16 vs 1 or 15 vs 2) gives an little boost in win probability. An advantage of modeling is that we can make predictions for match-ups that have never occurred (like a seed difference of 14).

ggplot(data=brackets,aes(SeedDiff,HigherSeedWon))+
stat_smooth(method="gam",family="binomial",se=F,formula=y~0+x,aes(colour="Logistic"),size=1)+
stat_smooth(method="gam",family="binomial",se=F,formula=y~s(x),aes(colour="GAM"),size=1)+
geom_jitter(alpha=.15,position = position_jitter(height = .025,width=.25))+
labs(x="Seed Difference",y="Game Won by Higher Seed",colour="Model")
Created by Pretty R at inside-R.org

Model Score Difference as a Function of  Seed Difference

We can also do the same thing with margin of victory. Here, I constrain the linear model to have an intercept of 0, meaning that two teams with the same seed should be evenly matched. Again, I included the GAM fit for comparison. The interpretations are similar to before, in that it seems that there is an increase in margin of victory for the heavily favored teams.

ggplot(data=brackets,aes(SeedDiff,HigherSeedScoreDiff))+
stat_smooth(method="lm",se=F,formula=y~0+x,aes(colour="Linear"),size=1)+
stat_smooth(method="gam",se=F,formula=y~s(x),aes(colour="GAM"),size=1)+
geom_jitter(alpha=.25,position = position_jitter(height = 0,width=.25))+
labs(x="Seed Difference",y="Margin of Victory by Higher Seed",colour="Model")
Created by Pretty R at inside-R.org

From these models of margin of victory we can infer the probability of the higher seed winning (again, assuming normality).

library(gam)
lm.seed=lm(HigherSeedScoreDiff~0+SeedDiff,data=brackets)
gam.seed=gam(HigherSeedScoreDiff~s(SeedDiff),data=brackets)
 
pred.lm.seed=predict(lm.seed,data.frame(SeedDiff=0:15),se.fit=TRUE)
pred.gam.seed=predict(gam.seed,data.frame(SeedDiff=0:15),se.fit=TRUE)
se.lm=sqrt(mean(lm.seed$residuals^2))
se.gam=sqrt(mean(gam.seed$residuals^2))
 
df1=data.frame(SeedDiff=0:15,ProbLM=1-pnorm(0,pred.lm.seed$fit,sqrt(se.lm^2+pred.lm.seed$se.fit^2)),
ProbGAM=1-pnorm(0,pred.gam.seed$fit,sqrt(se.gam^2+pred.gam.seed$se.fit^2)))
ggplot(df1)+geom_hline(yintercept=0.5,linetype=2)+
geom_line(aes(SeedDiff,ProbLM,colour="Linear"),size=1)+
geom_line(aes(SeedDiff,ProbGAM,colour="GAM"),size=1)+
labs(x="Seed Difference",y="Probability of Higher Seed Winning",colour="Model")
Created by Pretty R at inside-R.org


Summary

Putting all of the estimates together, you can easily spot the differences between the models. The two assumptions that just used the data between specific seeds look pretty similar. It looks like using score differential is a little more reasonable of the two. The two GAMs have a similar trend and so did the  linear and logistic models. If someone asks you what the probability that a 16 seed beats a 1 seed, you have at least 6 different answers.

This post highlights the many different ways someone can analyze the same data. Simply statistics talked a bit about this in a recent podcast. In this case, the differences are not huge, but there are noticeable changes. So the next time you read about an analysis that someone did, keep in mind all the decisions that they had to make and what type a sensitivity they would have on the results.

logit.seed=glm(HigherSeedWon~0+SeedDiff,data=brackets,family=binomial(logit))
logit.seed.gam=gam(HigherSeedWon~s(SeedDiff),data=brackets,family=binomial(logit))
 
df2=data.frame(SeedDiff=0:15,
ProbLM=1-pnorm(0,pred.lm.seed$fit,sqrt(se.lm^2+pred.lm.seed$se.fit^2)),
ProbGAM=1-pnorm(0,pred.gam.seed$fit,sqrt(se.gam^2+pred.gam.seed$se.fit^2)),
ProbLogit=predict(logit.seed,data.frame(SeedDiff=0:15),type="response"),
ProbLogitGAM=predict(logit.seed.gam,data.frame(SeedDiff=0:15),type="response"))
df2=merge(df2,data.frame(SeedDiff=seed.diffs,ProbFreq=win.pct),all.x=T)
df2=merge(df2,data.frame(SeedDiff=seed.diffs,ProbScore=1-pnorm(0,means,sds)),all.x=T)
ggplot(df2,aes(SeedDiff))+geom_hline(yintercept=0.5,linetype=2)+
geom_line(aes(y=ProbLM,colour="Linear"),size=1)+
geom_line(aes(y=ProbGAM,colour="GAM"),size=1)+
geom_line(aes(y=ProbLogit,colour="Logistic"),size=1)+
geom_line(aes(y=ProbLogitGAM,colour="Logistic GAM"),size=1)+
geom_line(aes(y=ProbFreq,colour="Frequencies"),size=1)+
geom_line(aes(y=ProbScore,colour="Score Diff"),size=1)+
geom_point(aes(y=ProbFreq,colour="Frequencies"),size=3)+
geom_point(aes(y=ProbScore,colour="Score Diff"),size=3)+
labs(x="Seed Difference",y="Probability of Higher Seed Winning",colour="Model")
 
ggplot(df2)+geom_hline(yintercept=0.5,linetype=2)+
geom_point(aes(x=SeedDiff,y=ProbFreq,colour="Frequencies"),size=1)
Created by Pretty R at inside-R.org


Note that the GAM functions did not have a way to easily restrict the win probability be equal to exactly 0.5 when the seed difference is 0. That is why you may notice the GAM model is a bit above 0.5 at 0.

Easily Access Academic Journals Off Campus With a Firefox Bookmark

| Comments

As a grad student, I do lots of searches for research related to my own. When I am off campus, a lot of the relevant results are not open access. In that case, I have to log onto my school’s library website and search for the journal or article there. It is quite a hassle. Luckily, I recently noticed that the website is predictably modified after I log into the library. I go to Ohio State University, and before and after logging in the websites will be http://www.jstor.org/… and http://www.jstor.org.proxy.lib.ohio-state.edu/…, for example. It doesn’t matter which journal or website I am looking at. If OSU has access, it will show up with .proxy.lib.ohio-state.edu after the .com/.org/etc.

So I was able to save myself a few steps by typing that in instead of going through the library’s website. I should also be sure to note that typing this in brings you to an OSU log in screen. Only those with valid access can use this shortcut, but I assume other schools have the same setup.

This is still more busy work than I would like, so I searched for ways to automatically do this. Lifehacker has some address bar shortcuts, but nothing that quite works. I came across a  Firefox add-on called URL Swap that seems like it could do the trick. However, when setting it up, I happened to find this post on Stack Overflow. The second answer has exactly what I am looking for, changing .ExtraPart to .proxy.lib.ohio-state.edu.

I created a bookmark on my bookmarks toolbar with the javascript:
javascript:(function(){ var curloc = window.location.toString(); var newloc = curloc.substring(0, curloc.indexOf("/", 8)) + ".proxy.lib.ohio-state.edu" + curloc.substring(curloc.indexOf("/", 8)); location.href=newloc; })()

and it works perfectly. This trick is so useful I wanted to share it to save others some time.

Copying Data From Excel to R and Back

| Comments

A lot of times we are given a data set in Excel format and we want to run a quick analysis using R’s functionality to look at advanced statistics or make better visualizations. There are packages for importing/exporting data from/to Excel, but I have found them to be hard to work with or only work with old versions of Excel (*.xls, not *.xlsx). So for a one time analysis, I usually save the file as a csv and import it into R.

This can be a little burdensome if you are trying to do something quick and creates a file that needs to be cleaned up later. An easier option is to copy and paste the data directly into R. This can be done by using “clipboard” as the file and specifying that it is tab delimited, since that is how Excel’s clipboard stores the data.

For example, say you have a table in excel you want to copy into R. First, copy it in Excel.

Then go into R and use this function.

read.excel <- function(header=TRUE,...) {
read.table("clipboard",sep="\t",header=header,...)
}
 
dat=read.excel()

This function specifies that you are reading data from the clipboard, that it is tab delimited, and that it has a header.

Similarly, you can copy from R to Excel using the same logic. Here I also make row.name=FALSE as default since I rarely have meaningful row names and they mess up the header alignment.

write.excel <- function(x,row.names=FALSE,col.names=TRUE,...) {
write.table(x,"clipboard",sep="\t",row.names=row.names,col.names=col.names,...)
}
 
write.excel(dat)
Created by Pretty R at inside-R.org

These functions can be added to you .RProfile so that they are always ready for a quick analysis!

Obviously, this technique does not encourage reproducible research. It is meant to be used for quick, ad hoc analysis and plotting; not something you would use for an analysis that needs to be done on a regular basis.

Comments

Justin Tapp
I'm using this function in R for Windows. When I view the data numerically, it's fine. But when I go to plot the data I get nonsense. How to describe it…the window has 0 to 250 on the x-axis (regardless of my data series) and white boxes across the middle.
Marek Sz
I create similar functions and got few tips. For reading from excel following settings can be useful:
na.strings = "" # to prevent replacing NA string to missing value
comment.char = "" # to not loose everything after # sign
quote = "" # or ' or " could mess with data
check.names = FALSE # if you want column names as in excel (spaces, special characters, etc.). You need to use `column name` in R to reference such columns.

For writing na="" replace missing values by empty string and not "NA" as on default.

Second thing is that you can increase size of clipboard by using e.g. "clipboard-10240" instead of "clipboard" (it's a size in Kb, so it's around 10Mb; see help for connection, section Clipboard) which allow to copy and paste larger tables.
Anonymous
rkward (http://rkward.sourceforge.net/) has a very nifty feature (Edit -> Paste Special…), that allows you to paste the copied data directly into your R source code, already formatted as a single string, vector or matrix.
William Yarberry
Excellent article. Real people are always busy and this is just the kind of article that helps us all. I have been doing scan() but when you start entering a few thousand rows that way, it gets a bit slow.
Anonymous
scan() allows you to just paste…

y <- scan()
Anonymous
Thank you for these!!
Anonymous
Thank you for the tip, this will help me a lot.
Also, it seems that, libreOffice also uses clipboard to store copied things. This function also works for libreOffice
Anonymous
I can't resist: you could just use
read.delim("clipboard")

(The "clipboard" parameter is 'doze only for the foreseeable future)

From "?read.delim"
read.delim(file, header = TRUE, sep = "\t", quote="\"", dec=".",
fill = TRUE, comment.char="", …)
Anonymous
Thank you very much Tony for your quick answer (on a Sunday afternoon!!).

Ernesto
Tony Hirst
@Ernesto It seems that on a Mac, you can use pbpaste ( http://stackoverflow.com/questions/9035674/r-function-to-copy-to-clipboard-on-mac-osx )

read.clipboard.mac <- function(header=TRUE,…) {
read.table(pipe("pbpaste"),sep="\t",header=header,…)
}
Anonymous
Thank you very much. Very useful. I work in both Windows and Mac Environments. The trick you show seems to work only in Windows. Any idea what to do in Mac? Thanks in advance,

Ernesto