The model I will build is motivated by ordinal regression. If we only were concerned with the success rate in one of the categories, we could use standard logistic regression, and the probability that player $i$ successfully made a play would be assumed to be $\sigma(\theta_i)$, where $\sigma()$ is the logistic function. Using our prior knowledge that plays categorized as easy should have a higher success rate than plays categorized as difficult, I would like to generalize this.

Say there are only two categories: easy and hard. We could model the probability that player $i$ successfully made an hard play as $\sigma(\theta_i)$ and the probability that he made an easy play as $\sigma(\theta_i+\gamma)$. Here, we would assume that $\gamma$ is the same for all players. This assumption implies that if player $i$ is better than player $j$ at easy plays, he will also be better at hard plays. This is a reasonable assumption, but maybe not true in all cases.

Since we have five different categories of difficulty, we can generalize this by having $\gamma_k, k=1,\ldots,4$. Again, these $\gamma_k$’s would be the same for everyone. A picture of what this looks like for shortstops is below. In this model, every player will effectively be shifting the curve either left or right. A positive $\theta_i$ means the player is better than average and cause the curve to shift left and vice versa for negative $\theta_i$.

I modeled this as a multi-level mixed effects model, with the players being random effects and the $\gamma_k$’s being fixed. Technically, I should optimize subject to the condition that the $\gamma_k$’s are increasing, but the unconstrained optimization always yields increasing $\gamma_k$’s because there is a big difference between success rate in the categories. I used combined data from 2012 and 2013 seasons and included all players with at least one success and one failure. I modeled each position separately. Modeling player effects as random, there is a fair amount of regression to the mean built in. In this sense, I am more trying to estimate the true ability of the player, rather than measuring what he did during the two years. This is an important distinction, which may differ from other defensive statistics.

Below is a summary of the results of the model for shortstops. I am only plotting the players with the at least 800 innings, for readability. A bonus of modeling the data like this is that we get standard error estimates as a result. I plotted the estimated effect of each player along with +/- 2 standard errors. We can be fairly certain that the effects for the top few shortstops is greater than 0 since their confidence intervals do not include 0. The same is true for the bottom few. Images for the other positions can be found here.

The results seem to make sense for the most part. Simmons and Tulowitzki have reputations as being strong defenders and Derek Jeter has a reputation as a poor defender.

Further, I can validate this data by comparing it to other defensive metrics. One that is readily available on Fangraphs is UZR per 150 games. For each position, I took the correlation of my estimated effect with UZR per 150 games, weighted by the number of innings played. Pitchers and catchers do not have UZR’s so I cannot compare them. The correlations, which are in the plot below, range from about 0.2 to 0.65.

In order to make this fielding metric more useful, I would like to convert the parameters to something more interpretable. One option which makes a lot of sense is “plays made above/below average”. Given an estimated $\theta_i$ for a player, we can calculate the probability that he would make a play in each of the five categories. We can then compare those probabilities to the probability an average player would make a play in each of the categories, which would be fit with $\theta=0$. Finally, we can weight these differences in probabilities by the relative rate that plays of various difficulties occur.

For example, assuming there are only two categories again, suppose a player has a 0.10 and 0.05 higher probability than average of making hard and easy plays, respectively. Further assume that 75% of all plays are hard and 25% are easy. On a random play, the improvement in probability over an average player of making a play is $.10(.75)+.05(.25)=0.0875$. If a player has an opportunity for 300 plays in an average season, this player would be $300 \times 0.0875=26.25$ plays better than average over a typical season.

I will assume that the number of opportunities to make plays is directly related to the number of innings played. To convert innings to opportunities, I took the median number of opportunities per inning for each position. For example, shortstops had the highest opportunities per inning at 0.40 and catchers had the lowest at 0.08. The plot below shows the distribution of opportunities per inning for each position.

We can extend this to the impact on saving runs from being scored as well by assuming each successful play saves $x$ runs. I will not do this for this analysis.

Finally, I put together a Shiny app to display the results. You can search by team, position, and innings played. A team of ‘- - -‘ means the player played for multiple teams over this period. You can also choose to display the results as a rate statistic (extra plays made per season) or a count statistic (extra plays made over the two seasons). To get a seasonal number, I assume position players played 150 games with 8.5 innings in each game. For pitchers, I assumed that they pitched 30 games, 6 innings each.

I don’t know if I will do anything more with this data, but if I could do it again, I may have modeled each year separately instead of combining the two years together. With that, it would have been interesting to model the effect of age by observing how a player’s ability to make a play changes from one year to the next. I also think it would be interesting to see how changing positions affects a players efficiency. For example, we could have a $9 \times 9$ matrix of fixed effects that represent the improvement or degradation in ability as a player switches from their main position to another one. Further assumptions would be needed to make sure the $\theta$’s are on the same scale for every position.

At the very least, this model and its results can be considered another data point in the analysis of a player’s fielding ability. One thing we need to be concerned about is the classification of each play into the difficultness categories. The human eye can be fooled into thinking a routine play is hard just because a fielder dove to make the play, when a superior fielder could have made it look easier.

I have put the R code together to do this analysis in a gist. If there is interest, I will put together a repo with all the data as well.

**Update**: I have a Github repository with the data, R code for the analysis, the results, and code for the Shiny app. Let me know what you think.

First off, I am assuming that you have scraped the appropriate data using the code from the previous post.

`library(lubridate)`

library(sqldf)

playlist=read.csv("CD101Playlist.csv",stringsAsFactors=FALSE)

dates=mdy(substring(playlist[,3],nchar(playlist[,3])-9,nchar(playlist[,3])))

times=hm(substring(playlist[,3],1,nchar(playlist[,3])-10))

playlist$Month=ymd(paste(year(dates),month(dates),"1",sep="-"))

playlist$Day=dates

playlist$Time=times

playlist=playlist[order(playlist$Day,playlist$Time),]

Next, I will select just the data from 2013 and find the songs that were played most often.

`playlist=subset(playlist,Day>=mdy("1/1/13"))`

playlist$ArtistSong=paste(playlist$Artist,playlist$Song,sep="-")

top.songs=sqldf("Select ArtistSong, Count(ArtistSong) as Num

From playlist

Group By ArtistSong

Order by Num DESC

Limit 10")

The top 10 songs are the following:

` Artist-Song Number Plays`

1 FITZ AND THE TANTRUMS-OUT OF MY LEAGUE 809

2 ALT J-BREEZEBLOCKS 764

3 COLD WAR KIDS-MIRACLE MILE 759

4 ATLAS GENIUS-IF SO 750

5 FOALS-MY NUMBER 687

6 MS MR-HURRICANE 679

7 THE NEIGHBOURHOOD-SWEATER WEATHER 657

8 CAPITAL CITIES-SAFE AND SOUND 646

9 VAMPIRE WEEKEND-DIANE YOUNG 639

10 THE FEATURES-THIS DISORDER 632

I will make a plot similar to the plots made in the last post to show when the top 5 songs were played throughout the year.

` `

plays.per.day=sqldf("Select Day, Count(Artist) as Num

From playlist

Group By Day

Order by Day")

playlist.top.songs=subset(playlist,ArtistSong %in% top.songs$ArtistSong[1:5])

song.per.day=sqldf(paste0("Select Day, ArtistSong, Count(ArtistSong) as Num

From [playlist.top.songs]

Group By Day, ArtistSong

Order by Day, ArtistSong"))

dspd=dcast(song.per.day,Day~ArtistSong,sum,value.var="Num")

song.per.day=merge(plays.per.day[,1,drop=FALSE],dspd,all.x=TRUE)

song.per.day[is.na(song.per.day)]=0

song.per.day=melt(song.per.day,1,variable.name="ArtistSong",value.name="Num")

song.per.day$Alpha=ifelse(song.per.day$Num>0,1,0)

library(ggplot2)

ggplot(song.per.day,aes(Day,Num,colour=ArtistSong))+geom_point(aes(alpha=Alpha))+

geom_smooth(method="gam",family=poisson,formula=y~s(x),se=F,size=1)+

labs(x="Date",y="Plays Per Day",title="Top Songs",colour=NULL)+

scale_alpha_continuous(guide=FALSE,range=c(0,.5))+theme_bw()

Alt-J was more popular in the beginning of the year and the Foals have been more popular recently.I can similarly summarize by artist as well.

`top.artists=sqldf("Select Artist, Count(Artist) as Num`

From playlist

Group By Artist

Order by Num DESC

Limit 10")

` Artist Num`

1 MUSE 1683

2 VAMPIRE WEEKEND 1504

3 SILVERSUN PICKUPS 1442

4 FOALS 1439

5 PHOENIX 1434

6 COLD WAR KIDS 1425

7 JAKE BUGG 1316

8 QUEENS OF THE STONE AGE 1296

9 ALT J 1233

10 OF MONSTERS AND MEN 1150

`playlist.top.artists=subset(playlist,Artist %in% top.artists$Artist[1:5])`

artists.per.day=sqldf(paste0("Select Day, Artist, Count(Artist) as Num

From [playlist.top.artists]

Group By Day, Artist

Order by Day, Artist"))

dspd=dcast(artists.per.day,Day~Artist,sum,value.var="Num")

artists.per.day=merge(plays.per.day[,1,drop=FALSE],dspd,all.x=TRUE)

artists.per.day[is.na(artists.per.day)]=0

artists.per.day=melt(artists.per.day,1,variable.name="Artist",value.name="Num")

artists.per.day$Alpha=ifelse(artists.per.day$Num>0,1,0)

ggplot(artists.per.day,aes(Day,Num,colour=Artist))+geom_point(aes(alpha=Alpha))+

geom_smooth(method="gam",family=poisson,formula=y~s(x),se=F,size=1)+

labs(x="Date",y="Plays Per Day",title="Top Artists",colour=NULL)+

scale_alpha_continuous(guide=FALSE,range=c(0,.5))+theme_bw()

The pattern for the artists are not as clear as it is for the songs.Finally, I wrote a Shiny interactive app. They are surprisingly easy to create and if you are thinking about experimenting with it, I suggest you try it. I will leave the code for the app in a gist. In the app, you can enter any artist you want, and it will show you the most popular songs on CD102.5 for that artist. You can also select the number of songs that it plots with the slider.

For example, even though Muse did not have one of the most popular songs of the year, they were still the band that was played the most. By typing in "MUSE" in the Artist text input, you will get the following output.

They had two songs that were very popular this year and a few others that were decently popular as well.

Play around with it and let me know what you think. ]]>

Using this structure, I decided to see if I could download all of their historical data and see how far it goes back. In the code below, I use the XML package to go to the website and download the 50 songs and then increment the number by 50 to find the previous 50 songs. I am telling the code to keep doing this until I get to January 1, 2012.

`library(ggplot2)`

theme_set(theme_bw())

library(XML)

library(lubridate)

library(sqldf)

startNum = 0

while (TRUE) {

theurl <- paste0("http://cd1025.com/about/playlists/now-playing/?start=",

startNum)

table <- readHTMLTable(theurl, stringsAsFactors = FALSE)[[1]]

if (startNum == 0) {

playlist = table[, -1]

} else {

playlist = rbind(playlist, table[, -1])

}

dt = mdy(substring(table[1, 4], nchar(table[1, 4]) - 9, nchar(table[1, 4])))

print(dt)

if (dt < mdy("1/1/12")) {

break

}

startNum = startNum + 50

}

playlist = unique(playlist) # Remove Dupes

write.csv(playlist, "CD101Playlist.csv", row.names = FALSE)

This takes a while and is fairly large. My file has over 150,000 songs. If you want just a little data, change the date to last week or so. The first thing I will do is parse the dates and times of the songs, order them, and look at the first few songs. You can see that data only goes back to March of 2012.`dates = mdy(substring(playlist[, 3], nchar(playlist[, 3]) - 9, nchar(playlist[, `

3])))

times = hm(substring(playlist[, 3], 1, nchar(playlist[, 3]) - 10))

playlist$Month = ymd(paste(year(dates), month(dates), "1", sep = "-"))

playlist$Day = dates

playlist$Time = times

playlist = playlist[order(playlist$Day, playlist$Time), ]

head(playlist)

`## Artist Song Last.Played`

## 151638 DEATH CAB FOR CUTIE YOU ARE A TOURIST 12:34am03/01/2012

## 151637 SLEEPER AGENT GET BURNED 12:38am03/01/2012

## 151636 WASHED OUT AMOR FATI 12:41am03/01/2012

## 151635 COLDPLAY CHARLIE BROWN 12:45am03/01/2012

## 151634 GROUPLOVE TONGUE TIED 12:49am03/01/2012

## 151633 SUGAR YOUR FAVORITE THING 12:52am03/01/2012

## Month Day Time

## 151638 2012-03-01 2012-03-01 34M 0S

## 151637 2012-03-01 2012-03-01 38M 0S

## 151636 2012-03-01 2012-03-01 41M 0S

## 151635 2012-03-01 2012-03-01 45M 0S

## 151634 2012-03-01 2012-03-01 49M 0S

## 151633 2012-03-01 2012-03-01 52M 0S

Using the sqldf package, I can easily see what the most played artists and songs are from the data I scraped.`sqldf("Select Artist, Count(Artist) as PlayCount`

From playlist

Group By Artist

Order by PlayCount DESC

Limit 10")

`## Artist PlayCount`

## 1 SILVERSUN PICKUPS 2340

## 2 THE BLACK KEYS 2203

## 3 MUSE 1988

## 4 THE SHINS 1885

## 5 OF MONSTERS AND MEN 1753

## 6 PASSION PIT 1552

## 7 GROUPLOVE 1544

## 8 RED HOT CHILI PEPPERS 1514

## 9 METRIC 1495

## 10 ATLAS GENIUS 1494

sqldf("Select Artist, Song, Count(Song) as PlayCount

From playlist

Group By Artist, Song

Order by PlayCount DESC

Limit 10")

`## Artist Song PlayCount`

## 1 PASSION PIT TAKE A WALK 828

## 2 SILVERSUN PICKUPS PIT, THE 825

## 3 ATLAS GENIUS TROJANS 819

## 4 WALK THE MOON ANNA SUN 742

## 5 THE BLACK KEYS LITTLE BLACK SUBMARINES 736

## 6 DIVINE FITS WOULD THAT NOT BE NICE 731

## 7 THE LUMINEERS HO HEY 722

## 8 CAPITAL CITIES SAFE AND SOUND 712

## 9 OF MONSTERS AND MEN MOUNTAIN SOUND 711

## 10 ALT J BREEZEBLOCKS 691

I am a little surprised that Silversun Pickups are the number one band, but everyone on the list makes sense. Looking at how the plays of the top artists have varied from month to month, you can see a few patterns. Muse has been more popular recently and The Shins and Grouplove have lost some steam.`artist.month=sqldf("Select Month, Artist, Count(Song) as Num`

From playlist

Group By Month, Artist

Order by Month, Artist")

artist=sqldf("Select Artist, Count(Artist) as Num

From playlist

Group By Artist

Order by Num DESC")

p=ggplot(subset(artist.month,Artist %in% head(artist$Artist,8)),aes(Month,Num))

p+geom_bar(stat="identity",aes(fill=Artist),position='fill',colour="grey")+

` labs(y="Percentage of Plays")`

For the play count of the top artists, I see some odd numbers in June and July of 2012. The number of plays went way down.

`p + geom_area(aes(fill = Artist), position = "stack", colour = 1) + labs(y = "Number of Plays")`

Looking into this further, I plotted the date and time that the song was played by the cumulative number of songs played since the beginning of the list. The plot should be a line with a constant slope, meaning that the plays per day are relatively constant. You can see in June and July of 2012 there are flat spots where there is no playlist history.

`qplot(playlist$Day + playlist$Time, 1:length(dates), geom = "path")`

There are also smaller flat spots in September and December, but I am going to decide that those are small enough not to affect any further analyses. From this, I am going to use data only from August 2012 to present.

`playlist = subset(playlist, Day >= mdy("8/1/12"))`

Next up, I am going to use this data to analyze the plays of artists from Summerfest, and try to infer if the play counts varied once they were added to the bill.
]]>Since the NCAA Men's Basketball Tournament has moved to 64 teams, a 16 seed as never upset a 1 seed. You might be tempted to say that the probability of such an event must be 0 then. But we know better than that.

In this post, I am interested in looking at different ways of estimating how the odds of winning a game change as the difference between seeds increases. I was able to download tournament data going back to the 1930s until 2012 from hoopstournament.net/Database.html. The tournament expanded to 64 teams in 1985, which is what I used for this post. I only used match ups in which one of the seeds was higher than the other because this was the easiest way to remove duplicates. (The database has each game listed twice, once with the winner as the first team and once with the loser as the first team. The vast majority (98.9%) of games had one team as a higher seed because an equal seed can only happen at the Final Four or later.)

library(ggplot2); theme_set(theme_bw())

brackets=read.csv("NCAAHistory.csv")

# use only data from 1985 on in which the first team has the higher seed

brackets=subset(brackets,Seed<Opponent.Seed & Year>=1985 & Round!="Opening Round")

brackets$SeedDiff=abs(brackets$Opponent.Seed-brackets$Seed)

brackets$HigherSeedWon=ifelse(brackets$Opponent.Seed>brackets$Seed,brackets$Wins,brackets$Losses)

brackets$HigherSeedScoreDiff=ifelse(brackets$Opponent.Seed>brackets$Seed,1,-1)*(brackets$Score-brackets$Opponent.Score)

seed.diffs=sort(unique(brackets$SeedDiff))

win.pct=sapply(seed.diffs,function(x) mean(brackets$HigherSeedWon[brackets$SeedDiff==x]))

ggplot(data=data.frame(seed.diffs,win.pct),aes(seed.diffs,win.pct))+geom_point()+

geom_hline(yintercept=0.5,linetype=2)+

geom_line()+labs(x="Seed Difference",y="Proportion of Games Won by Higher Seed")

We have no such conflict of interest, so we should try to make use of any information available. A simple way to do that is to look at the mean and standard deviation of the margin of victory when the 16 seed is playing the 1 seed. Below is a plot of the mean score difference with a ribbon for the +/- 2 standard deviations.

seed.diffs=sort(unique(brackets$SeedDiff))

means=sapply(seed.diffs,function(x) mean(brackets$HigherSeedScoreDiff[brackets$SeedDiff==x]))

sds=sapply(seed.diffs,function(x) sd(brackets$HigherSeedScoreDiff[brackets$SeedDiff==x]))

ggplot(data=data.frame(seed.diffs,means,sds),aes(seed.diffs,means))+

geom_ribbon(aes(ymin=means-2*sds,ymax=means+2*sds),alpha=.5)+geom_point()+geom_line()+

geom_hline(yintercept=0,linetype=2)+

labs(x="Seed Difference",y="Margin of Victory by Higher Seed")

You can see that the ribbon includes zero for all seed differences except 15. If we assume that the score differences are roughly normal, we can calculate the probability that the score difference will be greater than 0. The results are largely the same as before, but we see now that there are no 100% estimates. Also, the 50% win percentage for a seed difference of 10 now looks a little more reasonable, albeit still out of line with the rest.

ggplot(data=data.frame(seed.diffs,means,sds),aes(seed.diffs,1-pnorm(0,means,sds)))+

geom_point()+geom_line()+geom_hline(yintercept=0.5,linetype=2)+

labs(x="Seed Difference",y="Probability of Higher Seed Winning Based on Margin of Victory")

In the plot below, you can see that the logistic model predicts that the probability of winning increases throughout until reaching about 90% for the 16 versus 1. I also included a non-linear generalized additive model (GAM) model for comparison. The GAM believes that being a big favorite (16 vs 1 or 15 vs 2) gives an little boost in win probability. An advantage of modeling is that we can make predictions for match-ups that have never occurred (like a seed difference of 14).

ggplot(data=brackets,aes(SeedDiff,HigherSeedWon))+

stat_smooth(method="gam",family="binomial",se=F,formula=y~0+x,aes(colour="Logistic"),size=1)+

stat_smooth(method="gam",family="binomial",se=F,formula=y~s(x),aes(colour="GAM"),size=1)+

geom_jitter(alpha=.15,position = position_jitter(height = .025,width=.25))+

labs(x="Seed Difference",y="Game Won by Higher Seed",colour="Model")

ggplot(data=brackets,aes(SeedDiff,HigherSeedScoreDiff))+

stat_smooth(method="lm",se=F,formula=y~0+x,aes(colour="Linear"),size=1)+

stat_smooth(method="gam",se=F,formula=y~s(x),aes(colour="GAM"),size=1)+

geom_jitter(alpha=.25,position = position_jitter(height = 0,width=.25))+

labs(x="Seed Difference",y="Margin of Victory by Higher Seed",colour="Model")

From these models of margin of victory we can infer the probability of the higher seed winning (again, assuming normality).

library(gam)

lm.seed=lm(HigherSeedScoreDiff~0+SeedDiff,data=brackets)

gam.seed=gam(HigherSeedScoreDiff~s(SeedDiff),data=brackets)

pred.lm.seed=predict(lm.seed,data.frame(SeedDiff=0:15),se.fit=TRUE)

pred.gam.seed=predict(gam.seed,data.frame(SeedDiff=0:15),se.fit=TRUE)

se.lm=sqrt(mean(lm.seed$residuals^2))

se.gam=sqrt(mean(gam.seed$residuals^2))

df1=data.frame(SeedDiff=0:15,ProbLM=1-pnorm(0,pred.lm.seed$fit,sqrt(se.lm^2+pred.lm.seed$se.fit^2)),

ProbGAM=1-pnorm(0,pred.gam.seed$fit,sqrt(se.gam^2+pred.gam.seed$se.fit^2)))

ggplot(df1)+geom_hline(yintercept=0.5,linetype=2)+

geom_line(aes(SeedDiff,ProbLM,colour="Linear"),size=1)+

geom_line(aes(SeedDiff,ProbGAM,colour="GAM"),size=1)+

labs(x="Seed Difference",y="Probability of Higher Seed Winning",colour="Model")

This post highlights the many different ways someone can analyze the same data. Simply statistics talked a bit about this in a recent podcast. In this case, the differences are not huge, but there are noticeable changes. So the next time you read about an analysis that someone did, keep in mind all the decisions that they had to make and what type a sensitivity they would have on the results.

logit.seed=glm(HigherSeedWon~0+SeedDiff,data=brackets,family=binomial(logit))

logit.seed.gam=gam(HigherSeedWon~s(SeedDiff),data=brackets,family=binomial(logit))

df2=data.frame(SeedDiff=0:15,

ProbLM=1-pnorm(0,pred.lm.seed$fit,sqrt(se.lm^2+pred.lm.seed$se.fit^2)),

ProbGAM=1-pnorm(0,pred.gam.seed$fit,sqrt(se.gam^2+pred.gam.seed$se.fit^2)),

ProbLogit=predict(logit.seed,data.frame(SeedDiff=0:15),type="response"),

ProbLogitGAM=predict(logit.seed.gam,data.frame(SeedDiff=0:15),type="response"))

df2=merge(df2,data.frame(SeedDiff=seed.diffs,ProbFreq=win.pct),all.x=T)

df2=merge(df2,data.frame(SeedDiff=seed.diffs,ProbScore=1-pnorm(0,means,sds)),all.x=T)

ggplot(df2,aes(SeedDiff))+geom_hline(yintercept=0.5,linetype=2)+

geom_line(aes(y=ProbLM,colour="Linear"),size=1)+

geom_line(aes(y=ProbGAM,colour="GAM"),size=1)+

geom_line(aes(y=ProbLogit,colour="Logistic"),size=1)+

geom_line(aes(y=ProbLogitGAM,colour="Logistic GAM"),size=1)+

geom_line(aes(y=ProbFreq,colour="Frequencies"),size=1)+

geom_line(aes(y=ProbScore,colour="Score Diff"),size=1)+

geom_point(aes(y=ProbFreq,colour="Frequencies"),size=3)+

geom_point(aes(y=ProbScore,colour="Score Diff"),size=3)+

labs(x="Seed Difference",y="Probability of Higher Seed Winning",colour="Model")

ggplot(df2)+geom_hline(yintercept=0.5,linetype=2)+

geom_point(aes(x=SeedDiff,y=ProbFreq,colour="Frequencies"),size=1)