Before selling, I remembered listening to a Planet Money episode about a couple of guys that tried to make money off of buying and selling used textbooks on Amazon. Their strategy was to buy books at the end of a semester when students are itching to get rid of them, and sell them to other students at the beginning of the next semester. To back up their business, they have been scraping Amazon’s website for years, keeping track of prices in order to find the optimal times to buy and sell.
I collected a few books I was willing to part with and set up a scraper in R. I am primarily interested in selling my books to Amazon, so I tracked Amazon’s TradeIn prices for these books. This was done fairly easily with Hadley Wickham’s package rvest and the Chrome extension Selector Gadget. Selector Gadget tells me that the node I’m interested in is #tradeInButton_tradeInValue
. The code to do the scraping is below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 

After manually collecting this data for less than a week, I am able to plot the trends for the eight books I am interested in selling. The plot and code are below.
1 2 3 4 5 6 7 

I am surprised how much the prices fluctuate. I expected them to be constant for the most part, with large changes once a week or less often. Apparently Amazon is doing quite a bit of tweaking to determine optimal price points. I would also guess that $2 is their minimum tradein price. It looks like I missed out on my best chance to sell A First Course in Stochastic Processes, but that the price of A Primer on Linear Models will keep rising forever.
]]>Time stacking is a way to combine all the photos into a single photo, instead of a movie. This is a common method to make star trails, and Matt Molloy has recently been experimenting with it in many different settings. There are many possible ways to achieve a time stack, but the most common way is to combine the photos with a lighten layer blend. For every pixel in the final image, the combined photo will use the corresponding pixel from the photo that was the brightest in all of the photos. This gives the desired result of motion of the stars or clouds in a scene.
Another way to combine the photos is through time slicing (see, for example, this photo). In time slicing, the final combined image will contain a “slice” from each of the original photos. Time slices can go lefttoright, righttoleft, toptobottom, or bottomtotop. For example, a time slice that goes from left to right will use vertical slices of the pixels. If you took 100 photos for your time lapse, each of which being 1000 pixels wide, the leftmost 10 vertical pixel slices of the final image would contain the corresponding pixels from the first photo, the 11th through 20th vertical pixel slices would contain the corresponding pixels from the second photo, and so on. Different directions will produce different effects.
There is free software available to do lighten layer blending of two photos, but I could not find any to automatically do it for a large number of photos. Similarly for the time slice, it is easy enough to manually slice a few photos, but not hundreds of photos. Therefore, I wrote a couple of scripts in R that would do this automatically. A gist of the scripts to create a time stack script and a time slice is here. They both require you to give the directory containing the JPEG photos and the time slice script let’s you enter the direction you would like it to go.
To try this out on my own photos, I used the time lapse I had created of a sunset (movie created with FFmpeg), which consisted of 225 photos taken over 20 minutes. The source material isn’t that great (the photos were out of focus), but you can still see the effects.
The following picture is a time stack.
The following four pictures are the time slices with different directions.
]]>The model I will build is motivated by ordinal regression. If we only were concerned with the success rate in one of the categories, we could use standard logistic regression, and the probability that player $i$ successfully made a play would be assumed to be $\sigma(\theta_i)$, where $\sigma()$ is the logistic function. Using our prior knowledge that plays categorized as easy should have a higher success rate than plays categorized as difficult, I would like to generalize this.
Say there are only two categories: easy and hard. We could model the probability that player $i$ successfully made an hard play as $\sigma(\theta_i)$ and the probability that he made an easy play as $\sigma(\theta_i+\gamma)$. Here, we would assume that $\gamma$ is the same for all players. This assumption implies that if player $i$ is better than player $j$ at easy plays, he will also be better at hard plays. This is a reasonable assumption, but maybe not true in all cases.
Since we have five different categories of difficulty, we can generalize this by having $\gamma_k, k=1,\ldots,4$. Again, these $\gamma_k$’s would be the same for everyone. A picture of what this looks like for shortstops is below. In this model, every player will effectively be shifting the curve either left or right. A positive $\theta_i$ means the player is better than average and cause the curve to shift left and vice versa for negative $\theta_i$.
I modeled this as a multilevel mixed effects model, with the players being random effects and the $\gamma_k$’s being fixed. Technically, I should optimize subject to the condition that the $\gamma_k$’s are increasing, but the unconstrained optimization always yields increasing $\gamma_k$’s because there is a big difference between success rate in the categories. I used combined data from 2012 and 2013 seasons and included all players with at least one success and one failure. I modeled each position separately. Modeling player effects as random, there is a fair amount of regression to the mean built in. In this sense, I am more trying to estimate the true ability of the player, rather than measuring what he did during the two years. This is an important distinction, which may differ from other defensive statistics.
Below is a summary of the results of the model for shortstops. I am only plotting the players with the at least 800 innings, for readability. A bonus of modeling the data like this is that we get standard error estimates as a result. I plotted the estimated effect of each player along with +/ 2 standard errors. We can be fairly certain that the effects for the top few shortstops is greater than 0 since their confidence intervals do not include 0. The same is true for the bottom few. Images for the other positions can be found here.
The results seem to make sense for the most part. Simmons and Tulowitzki have reputations as being strong defenders and Derek Jeter has a reputation as a poor defender.
Further, I can validate this data by comparing it to other defensive metrics. One that is readily available on Fangraphs is UZR per 150 games. For each position, I took the correlation of my estimated effect with UZR per 150 games, weighted by the number of innings played. Pitchers and catchers do not have UZR’s so I cannot compare them. The correlations, which are in the plot below, range from about 0.2 to 0.65.
In order to make this fielding metric more useful, I would like to convert the parameters to something more interpretable. One option which makes a lot of sense is “plays made above/below average”. Given an estimated $\theta_i$ for a player, we can calculate the probability that he would make a play in each of the five categories. We can then compare those probabilities to the probability an average player would make a play in each of the categories, which would be fit with $\theta=0$. Finally, we can weight these differences in probabilities by the relative rate that plays of various difficulties occur.
For example, assuming there are only two categories again, suppose a player has a 0.10 and 0.05 higher probability than average of making hard and easy plays, respectively. Further assume that 75% of all plays are hard and 25% are easy. On a random play, the improvement in probability over an average player of making a play is $.10(.75)+.05(.25)=0.0875$. If a player has an opportunity for 300 plays in an average season, this player would be $300 \times 0.0875=26.25$ plays better than average over a typical season.
I will assume that the number of opportunities to make plays is directly related to the number of innings played. To convert innings to opportunities, I took the median number of opportunities per inning for each position. For example, shortstops had the highest opportunities per inning at 0.40 and catchers had the lowest at 0.08. The plot below shows the distribution of opportunities per inning for each position.
We can extend this to the impact on saving runs from being scored as well by assuming each successful play saves $x$ runs. I will not do this for this analysis.
[Update: The Shiny app is no longer live, but you can view the resulting estimates in the Github repository.]
Finally, I put together a Shiny app to display the results. You can search by team, position, and innings played. A team of ‘  ‘ means the player played for multiple teams over this period. You can also choose to display the results as a rate statistic (extra plays made per season) or a count statistic (extra plays made over the two seasons). To get a seasonal number, I assume position players played 150 games with 8.5 innings in each game. For pitchers, I assumed that they pitched 30 games, 6 innings each.
I don’t know if I will do anything more with this data, but if I could do it again, I may have modeled each year separately instead of combining the two years together. With that, it would have been interesting to model the effect of age by observing how a player’s ability to make a play changes from one year to the next. I also think it would be interesting to see how changing positions affects a players efficiency. For example, we could have a $9 \times 9$ matrix of fixed effects that represent the improvement or degradation in ability as a player switches from their main position to another one. Further assumptions would be needed to make sure the $\theta$’s are on the same scale for every position.
At the very least, this model and its results can be considered another data point in the analysis of a player’s fielding ability. One thing we need to be concerned about is the classification of each play into the difficultness categories. The human eye can be fooled into thinking a routine play is hard just because a fielder dove to make the play, when a superior fielder could have made it look easier.
I have put the R code together to do this analysis in a gist. If there is interest, I will put together a repo with all the data as well.
Update: I have a Github repository with the data, R code for the analysis, the results, and code for the Shiny app. Let me know what you think.
]]>One great thing about the class, however, is that they are truely experts and have collaborated with many of the influencial researchers in their field. Because of this, when covering certain topics, they have included interviews with statisticians who made important developments to the field. When introducing the class to R, they interviewed John Chambers, who was able to give a personal account of the history of S and R because he was one of the developers. Further, when covering resampling methods, they spoke with Brad Efron, who talked about the history of the bootstrap and how he struggled to get it published.
Today, they released a video interview with Jerome Friedman. Friedman revealed many interesting facts about the history of treebased methods, including the fact that there weren’t really any journal articles written about CART when they wrote their book. There was one quote that I particularly enjoyed.
]]>And of course, I’m very gratified that something that I was intellectually interested in for all this time has now become very popular and very important. I mean, data has risen to the top. My only regret is two of my mentors who also pushed it, probably harder and more effectively than I did – namely, John Tukey and Leo Breiman – are not around to actually see how data has triumphed over, say, theorem proving.
library(lubridate)
library(sqldf)
playlist=read.csv("CD101Playlist.csv",stringsAsFactors=FALSE)
dates=mdy(substring(playlist[,3],nchar(playlist[,3])9,nchar(playlist[,3])))
times=hm(substring(playlist[,3],1,nchar(playlist[,3])10))
playlist$Month=ymd(paste(year(dates),month(dates),"1",sep=""))
playlist$Day=dates
playlist$Time=times
playlist=playlist[order(playlist$Day,playlist$Time),]
playlist=subset(playlist,Day>=mdy("1/1/13"))
playlist$ArtistSong=paste(playlist$Artist,playlist$Song,sep="")
top.songs=sqldf("Select ArtistSong, Count(ArtistSong) as Num
From playlist
Group By ArtistSong
Order by Num DESC
Limit 10")
ArtistSong Number Plays
1 FITZ AND THE TANTRUMSOUT OF MY LEAGUE 809
2 ALT JBREEZEBLOCKS 764
3 COLD WAR KIDSMIRACLE MILE 759
4 ATLAS GENIUSIF SO 750
5 FOALSMY NUMBER 687
6 MS MRHURRICANE 679
7 THE NEIGHBOURHOODSWEATER WEATHER 657
8 CAPITAL CITIESSAFE AND SOUND 646
9 VAMPIRE WEEKENDDIANE YOUNG 639
10 THE FEATURESTHIS DISORDER 632
plays.per.day=sqldf("Select Day, Count(Artist) as Num
From playlist
Group By Day
Order by Day")
playlist.top.songs=subset(playlist,ArtistSong %in% top.songs$ArtistSong[1:5])
song.per.day=sqldf(paste0("Select Day, ArtistSong, Count(ArtistSong) as Num
From [playlist.top.songs]
Group By Day, ArtistSong
Order by Day, ArtistSong"))
dspd=dcast(song.per.day,Day~ArtistSong,sum,value.var="Num")
song.per.day=merge(plays.per.day[,1,drop=FALSE],dspd,all.x=TRUE)
song.per.day[is.na(song.per.day)]=0
song.per.day=melt(song.per.day,1,variable.name="ArtistSong",value.name="Num")
song.per.day$Alpha=ifelse(song.per.day$Num>0,1,0)
library(ggplot2)
ggplot(song.per.day,aes(Day,Num,colour=ArtistSong))+geom_point(aes(alpha=Alpha))+
geom_smooth(method="gam",family=poisson,formula=y~s(x),se=F,size=1)+
labs(x="Date",y="Plays Per Day",title="Top Songs",colour=NULL)+
scale_alpha_continuous(guide=FALSE,range=c(0,.5))+theme_bw()
AltJ was more popular in the beginning of the year and the Foals have been more popular recently.top.artists=sqldf("Select Artist, Count(Artist) as Num
From playlist
Group By Artist
Order by Num DESC
Limit 10")
Artist Num
1 MUSE 1683
2 VAMPIRE WEEKEND 1504
3 SILVERSUN PICKUPS 1442
4 FOALS 1439
5 PHOENIX 1434
6 COLD WAR KIDS 1425
7 JAKE BUGG 1316
8 QUEENS OF THE STONE AGE 1296
9 ALT J 1233
10 OF MONSTERS AND MEN 1150
playlist.top.artists=subset(playlist,Artist %in% top.artists$Artist[1:5])
artists.per.day=sqldf(paste0("Select Day, Artist, Count(Artist) as Num
From [playlist.top.artists]
Group By Day, Artist
Order by Day, Artist"))
dspd=dcast(artists.per.day,Day~Artist,sum,value.var="Num")
artists.per.day=merge(plays.per.day[,1,drop=FALSE],dspd,all.x=TRUE)
artists.per.day[is.na(artists.per.day)]=0
artists.per.day=melt(artists.per.day,1,variable.name="Artist",value.name="Num")
artists.per.day$Alpha=ifelse(artists.per.day$Num>0,1,0)
ggplot(artists.per.day,aes(Day,Num,colour=Artist))+geom_point(aes(alpha=Alpha))+
geom_smooth(method="gam",family=poisson,formula=y~s(x),se=F,size=1)+
labs(x="Date",y="Plays Per Day",title="Top Artists",colour=NULL)+
scale_alpha_continuous(guide=FALSE,range=c(0,.5))+theme_bw()
The pattern for the artists are not as clear as it is for the songs.library(ggplot2)
theme_set(theme_bw())
library(XML)
library(lubridate)
library(sqldf)
startNum = 0
while (TRUE) {
theurl < paste0("http://cd1025.com/about/playlists/nowplaying/?start=",
startNum)
table < readHTMLTable(theurl, stringsAsFactors = FALSE)[[1]]
if (startNum == 0) {
playlist = table[, 1]
} else {
playlist = rbind(playlist, table[, 1])
}
dt = mdy(substring(table[1, 4], nchar(table[1, 4])  9, nchar(table[1, 4])))
print(dt)
if (dt < mdy("1/1/12")) {
break
}
startNum = startNum + 50
}
playlist = unique(playlist) # Remove Dupes
write.csv(playlist, "CD101Playlist.csv", row.names = FALSE)
This takes a while and is fairly large. My file has over 150,000 songs. If you want just a little data, change the date to last week or so. The first thing I will do is parse the dates and times of the songs, order them, and look at the first few songs. You can see that data only goes back to March of 2012.dates = mdy(substring(playlist[, 3], nchar(playlist[, 3])  9, nchar(playlist[,
3])))
times = hm(substring(playlist[, 3], 1, nchar(playlist[, 3])  10))
playlist$Month = ymd(paste(year(dates), month(dates), "1", sep = ""))
playlist$Day = dates
playlist$Time = times
playlist = playlist[order(playlist$Day, playlist$Time), ]
head(playlist)
## Artist Song Last.Played
## 151638 DEATH CAB FOR CUTIE YOU ARE A TOURIST 12:34am03/01/2012
## 151637 SLEEPER AGENT GET BURNED 12:38am03/01/2012
## 151636 WASHED OUT AMOR FATI 12:41am03/01/2012
## 151635 COLDPLAY CHARLIE BROWN 12:45am03/01/2012
## 151634 GROUPLOVE TONGUE TIED 12:49am03/01/2012
## 151633 SUGAR YOUR FAVORITE THING 12:52am03/01/2012
## Month Day Time
## 151638 20120301 20120301 34M 0S
## 151637 20120301 20120301 38M 0S
## 151636 20120301 20120301 41M 0S
## 151635 20120301 20120301 45M 0S
## 151634 20120301 20120301 49M 0S
## 151633 20120301 20120301 52M 0S
Using the sqldf package, I can easily see what the most played artists and songs are from the data I scraped.sqldf("Select Artist, Count(Artist) as PlayCount
From playlist
Group By Artist
Order by PlayCount DESC
Limit 10")
## Artist PlayCount
## 1 SILVERSUN PICKUPS 2340
## 2 THE BLACK KEYS 2203
## 3 MUSE 1988
## 4 THE SHINS 1885
## 5 OF MONSTERS AND MEN 1753
## 6 PASSION PIT 1552
## 7 GROUPLOVE 1544
## 8 RED HOT CHILI PEPPERS 1514
## 9 METRIC 1495
## 10 ATLAS GENIUS 1494
sqldf("Select Artist, Song, Count(Song) as PlayCount
From playlist
Group By Artist, Song
Order by PlayCount DESC
Limit 10")
## Artist Song PlayCount
## 1 PASSION PIT TAKE A WALK 828
## 2 SILVERSUN PICKUPS PIT, THE 825
## 3 ATLAS GENIUS TROJANS 819
## 4 WALK THE MOON ANNA SUN 742
## 5 THE BLACK KEYS LITTLE BLACK SUBMARINES 736
## 6 DIVINE FITS WOULD THAT NOT BE NICE 731
## 7 THE LUMINEERS HO HEY 722
## 8 CAPITAL CITIES SAFE AND SOUND 712
## 9 OF MONSTERS AND MEN MOUNTAIN SOUND 711
## 10 ALT J BREEZEBLOCKS 691
I am a little surprised that Silversun Pickups are the number one band, but everyone on the list makes sense. Looking at how the plays of the top artists have varied from month to month, you can see a few patterns. Muse has been more popular recently and The Shins and Grouplove have lost some steam.artist.month=sqldf("Select Month, Artist, Count(Song) as Num
From playlist
Group By Month, Artist
Order by Month, Artist")
artist=sqldf("Select Artist, Count(Artist) as Num
From playlist
Group By Artist
Order by Num DESC")
p=ggplot(subset(artist.month,Artist %in% head(artist$Artist,8)),aes(Month,Num))
p+geom_bar(stat="identity",aes(fill=Artist),position='fill',colour="grey")+
labs(y="Percentage of Plays")
p + geom_area(aes(fill = Artist), position = "stack", colour = 1) + labs(y = "Number of Plays")
qplot(playlist$Day + playlist$Time, 1:length(dates), geom = "path")
playlist = subset(playlist, Day >= mdy("8/1/12"))
Next up, I am going to use this data to analyze the plays of artists from Summerfest, and try to infer if the play counts varied once they were added to the bill.
]]>library(ggplot2); theme_set(theme_bw())
brackets=read.csv("NCAAHistory.csv")
# use only data from 1985 on in which the first team has the higher seed
brackets=subset(brackets,Seed<Opponent.Seed & Year>=1985 & Round!="Opening Round")
brackets$SeedDiff=abs(brackets$Opponent.Seedbrackets$Seed)
brackets$HigherSeedWon=ifelse(brackets$Opponent.Seed>brackets$Seed,brackets$Wins,brackets$Losses)
brackets$HigherSeedScoreDiff=ifelse(brackets$Opponent.Seed>brackets$Seed,1,1)*(brackets$Scorebrackets$Opponent.Score)
seed.diffs=sort(unique(brackets$SeedDiff))
win.pct=sapply(seed.diffs,function(x) mean(brackets$HigherSeedWon[brackets$SeedDiff==x]))
ggplot(data=data.frame(seed.diffs,win.pct),aes(seed.diffs,win.pct))+geom_point()+
geom_hline(yintercept=0.5,linetype=2)+
geom_line()+labs(x="Seed Difference",y="Proportion of Games Won by Higher Seed")
seed.diffs=sort(unique(brackets$SeedDiff))
means=sapply(seed.diffs,function(x) mean(brackets$HigherSeedScoreDiff[brackets$SeedDiff==x]))
sds=sapply(seed.diffs,function(x) sd(brackets$HigherSeedScoreDiff[brackets$SeedDiff==x]))
ggplot(data=data.frame(seed.diffs,means,sds),aes(seed.diffs,means))+
geom_ribbon(aes(ymin=means2*sds,ymax=means+2*sds),alpha=.5)+geom_point()+geom_line()+
geom_hline(yintercept=0,linetype=2)+
labs(x="Seed Difference",y="Margin of Victory by Higher Seed")
ggplot(data=data.frame(seed.diffs,means,sds),aes(seed.diffs,1pnorm(0,means,sds)))+
geom_point()+geom_line()+geom_hline(yintercept=0.5,linetype=2)+
labs(x="Seed Difference",y="Probability of Higher Seed Winning Based on Margin of Victory")
ggplot(data=brackets,aes(SeedDiff,HigherSeedWon))+
stat_smooth(method="gam",family="binomial",se=F,formula=y~0+x,aes(colour="Logistic"),size=1)+
stat_smooth(method="gam",family="binomial",se=F,formula=y~s(x),aes(colour="GAM"),size=1)+
geom_jitter(alpha=.15,position = position_jitter(height = .025,width=.25))+
labs(x="Seed Difference",y="Game Won by Higher Seed",colour="Model")
ggplot(data=brackets,aes(SeedDiff,HigherSeedScoreDiff))+
stat_smooth(method="lm",se=F,formula=y~0+x,aes(colour="Linear"),size=1)+
stat_smooth(method="gam",se=F,formula=y~s(x),aes(colour="GAM"),size=1)+
geom_jitter(alpha=.25,position = position_jitter(height = 0,width=.25))+
labs(x="Seed Difference",y="Margin of Victory by Higher Seed",colour="Model")
library(gam)
lm.seed=lm(HigherSeedScoreDiff~0+SeedDiff,data=brackets)
gam.seed=gam(HigherSeedScoreDiff~s(SeedDiff),data=brackets)
pred.lm.seed=predict(lm.seed,data.frame(SeedDiff=0:15),se.fit=TRUE)
pred.gam.seed=predict(gam.seed,data.frame(SeedDiff=0:15),se.fit=TRUE)
se.lm=sqrt(mean(lm.seed$residuals^2))
se.gam=sqrt(mean(gam.seed$residuals^2))
df1=data.frame(SeedDiff=0:15,ProbLM=1pnorm(0,pred.lm.seed$fit,sqrt(se.lm^2+pred.lm.seed$se.fit^2)),
ProbGAM=1pnorm(0,pred.gam.seed$fit,sqrt(se.gam^2+pred.gam.seed$se.fit^2)))
ggplot(df1)+geom_hline(yintercept=0.5,linetype=2)+
geom_line(aes(SeedDiff,ProbLM,colour="Linear"),size=1)+
geom_line(aes(SeedDiff,ProbGAM,colour="GAM"),size=1)+
labs(x="Seed Difference",y="Probability of Higher Seed Winning",colour="Model")
logit.seed=glm(HigherSeedWon~0+SeedDiff,data=brackets,family=binomial(logit))
logit.seed.gam=gam(HigherSeedWon~s(SeedDiff),data=brackets,family=binomial(logit))
df2=data.frame(SeedDiff=0:15,
ProbLM=1pnorm(0,pred.lm.seed$fit,sqrt(se.lm^2+pred.lm.seed$se.fit^2)),
ProbGAM=1pnorm(0,pred.gam.seed$fit,sqrt(se.gam^2+pred.gam.seed$se.fit^2)),
ProbLogit=predict(logit.seed,data.frame(SeedDiff=0:15),type="response"),
ProbLogitGAM=predict(logit.seed.gam,data.frame(SeedDiff=0:15),type="response"))
df2=merge(df2,data.frame(SeedDiff=seed.diffs,ProbFreq=win.pct),all.x=T)
df2=merge(df2,data.frame(SeedDiff=seed.diffs,ProbScore=1pnorm(0,means,sds)),all.x=T)
ggplot(df2,aes(SeedDiff))+geom_hline(yintercept=0.5,linetype=2)+
geom_line(aes(y=ProbLM,colour="Linear"),size=1)+
geom_line(aes(y=ProbGAM,colour="GAM"),size=1)+
geom_line(aes(y=ProbLogit,colour="Logistic"),size=1)+
geom_line(aes(y=ProbLogitGAM,colour="Logistic GAM"),size=1)+
geom_line(aes(y=ProbFreq,colour="Frequencies"),size=1)+
geom_line(aes(y=ProbScore,colour="Score Diff"),size=1)+
geom_point(aes(y=ProbFreq,colour="Frequencies"),size=3)+
geom_point(aes(y=ProbScore,colour="Score Diff"),size=3)+
labs(x="Seed Difference",y="Probability of Higher Seed Winning",colour="Model")
ggplot(df2)+geom_hline(yintercept=0.5,linetype=2)+
geom_point(aes(x=SeedDiff,y=ProbFreq,colour="Frequencies"),size=1)
javascript:(function(){ var curloc = window.location.toString(); var newloc = curloc.substring(0, curloc.indexOf("/", 8)) + ".proxy.lib.ohiostate.edu" + curloc.substring(curloc.indexOf("/", 8)); location.href=newloc; })()
read.excel < function(header=TRUE,...) {
read.table("clipboard",sep="\t",header=header,...)
}
dat=read.excel()
write.excel < function(x,row.names=FALSE,col.names=TRUE,...) {
write.table(x,"clipboard",sep="\t",row.names=row.names,col.names=col.names,...)
}
write.excel(dat)
reference=readLines("warandpeace.txt")
reference=toupper(reference)
trans.mat=matrix(0,27,27)
rownames(trans.mat)=colnames(trans.mat)=c(toupper(letters),"")
lastletter=""
for (ln in 1:length(reference)) {
if (ln %% 1000 ==0) {cat("Line",ln,"\n")}
for (pos in 1:nchar(reference[ln])) {
curletter=substring(reference[ln],pos,pos)
if (curletter %in% toupper(letters)) {
trans.mat[rownames(trans.mat)==lastletter,
colnames(trans.mat)==curletter]=
trans.mat[rownames(trans.mat)==lastletter,
colnames(trans.mat)==curletter]+1
lastletter=curletter
} else {
if (lastletter!="") {
trans.mat[rownames(trans.mat)==lastletter,27]=
trans.mat[rownames(trans.mat)==lastletter,27]+1
lastletter=""
}
}
}
curletter=""
if (lastletter!="") {
trans.mat[rownames(trans.mat)==lastletter,27]=
trans.mat[rownames(trans.mat)==lastletter,27]+1
}
lastletter=""
}
trans.prob.mat=sweep(trans.mat+1,1,rowSums(trans.mat+1),FUN="/")
decode < function(mapping,coded) {
coded=toupper(coded)
decoded=coded
for (i in 1:nchar(coded)) {
if (substring(coded,i,i) %in% toupper(letters)) {
substring(decoded,i,i)=toupper(letters[mapping==substring(coded,i,i)])
}
}
decoded
}
log.prob < function(mapping,decoded) {
logprob=0
lastletter=""
for (i in 1:nchar(decoded)) {
curletter=substring(decoded,i,i)
if (curletter %in% toupper(letters)) {
logprob=logprob+log(trans.prob.mat[rownames(trans.mat)==lastletter,
colnames(trans.mat)==curletter])
lastletter=curletter
} else {
if (lastletter!="") {
logprob=logprob+log(trans.prob.mat[rownames(trans.mat)==lastletter,27])
lastletter=""
}
}
}
if (lastletter!="") {
logprob=logprob+log(trans.prob.mat[rownames(trans.mat)==lastletter,27])
lastletter=""
}
logprob
}
correctTxt="ENTER HAMLET HAM TO BE OR NOT TO BE THAT IS THE QUESTION WHETHER TIS NOBLER IN THE MIND TO SUFFER THE SLINGS AND ARROWS OF OUTRAGEOUS FORTUNE OR TO TAKE ARMS AGAINST A SEA OF TROUBLES AND BY OPPOSING END"
coded=decode(sample(toupper(letters)),correctTxt) # randomly scramble the text
mapping=sample(toupper(letters)) # initialize a random mapping
i=1
iters=2000
cur.decode=decode(mapping,coded)
cur.loglike=log.prob(mapping,cur.decode)
max.loglike=cur.loglike
max.decode=cur.decode
while (i<=iters) {
proposal=sample(1:26,2) # select 2 letters to switch
prop.mapping=mapping
prop.mapping[proposal[1]]=mapping[proposal[2]]
prop.mapping[proposal[2]]=mapping[proposal[1]]
prop.decode=decode(prop.mapping,coded)
prop.loglike=log.prob(prop.mapping,prop.decode)
if (runif(1)<exp(prop.loglikecur.loglike)) {
mapping=prop.mapping
cur.decode=prop.decode
cur.loglike=prop.loglike
if (cur.loglike>max.loglike) {
max.loglike=cur.loglike
max.decode=cur.decode
}
cat(i,cur.decode,"\n")
i=i+1
}
}
# rbm_w is a matrix of size <number of hidden units> by <number of visible units>
# visible_state is matrix of size <number of visible units> by <number of data cases>
# hidden_state is a binary matrix of size <number of hidden units> by <number of data cases>
visible_state_to_hidden_probabilities < function(rbm_w, visible_state) {
1/(1+exp(rbm_w %*% visible_state))
}
hidden_state_to_visible_probabilities < function(rbm_w, hidden_state) {
1/(1+exp(t(rbm_w) %*% hidden_state))
}
configuration_goodness < function(rbm_w, visible_state, hidden_state) {
out=0
for (i in 1:dim(visible_state)[2]) {
out=out+t(hidden_state[,i]) %*% rbm_w %*% visible_state[,i]
}
out/dim(visible_state)[2]
}
configuration_goodness_gradient < function(visible_state, hidden_state) {
hidden_state %*% t(visible_state)/dim(visible_state)[2]
}
sample_bernoulli < function(mat) {
dims=dim(mat)
matrix(rbinom(prod(dims),size=1,prob=c(mat)),dims[1],dims[2])
}
cd1 < function(rbm_w, visible_data) {
visible_data = sample_bernoulli(visible_data)
H0=sample_bernoulli(visible_state_to_hidden_probabilities(rbm_w, visible_data))
vh0=configuration_goodness_gradient(visible_data, H0)
V1=sample_bernoulli(hidden_state_to_visible_probabilities(rbm_w, H0))
H1=visible_state_to_hidden_probabilities(rbm_w, V1)
vh1=configuration_goodness_gradient(V1, H1)
vh0vh1
}
rbm < function(num_hidden, training_data, learning_rate, n_iterations, mini_batch_size=100, momentum=0.9, quiet=FALSE) {
# This trains a model that's defined by a single matrix of weights.
# <num_hidden> is the number of hidden units
# cd1 is a function that takes parameters <model> and <data> and returns the gradient (or approximate gradient in the case of CD1) of the function that we're maximizing. Note the contrast with the loss function that we saw in PA3, which we were minimizing. The returned gradient is an array of the same shape as the provided <model> parameter.
# This uses minibatches no weight decay and no early stopping.
# This returns the matrix of weights of the trained model.
n=dim(training_data)[2]
p=dim(training_data)[1]
if (n %% mini_batch_size != 0) {
stop("the number of test cases must be divisable by the mini_batch_size")
}
model = (matrix(runif(num_hidden*p),num_hidden,p) * 2  1) * 0.1
momentum_speed = matrix(0,num_hidden,p)
start_of_next_mini_batch = 1;
for (iteration_number in 1:n_iterations) {
if (!quiet) {cat("Iter",iteration_number,"\n")}
mini_batch = training_data[, start_of_next_mini_batch:(start_of_next_mini_batch + mini_batch_size  1)]
start_of_next_mini_batch = (start_of_next_mini_batch + mini_batch_size) %% n
gradient = cd1(model, mini_batch)
momentum_speed = momentum * momentum_speed + gradient
model = model + momentum_speed * learning_rate
}
return(model)
}
weights=rbm(num_hidden=30, training_data=train, learning_rate=.09, n_iterations=5000,
mini_batch_size=100, momentum=0.9)
library(ggplot2)
library(reshape2)
mw=melt(weights); mw$Var3=floor((mw$Var21)/16)+1; mw$Var2=(mw$Var21)%%16 + 1; mw$Var3=17mw$Var3;
ggplot(data=mw)+geom_tile(aes(Var2,Var3,fill=value))+facet_wrap(~Var1,nrow=5)+
scale_fill_continuous(low='white',high='black')+coord_equal()+
labs(x=NULL,y=NULL,title="Visualization of Weights")+
theme(legend.position="none")
votes = read.csv("HOF votes.csv", row.names = 1, header = TRUE)
pca = princomp(votes)
screeplot(pca, type = "l", main = "Scree Plot")
dotchart(sort(pca$loadings[, 1]), main = "First Principal Component")
The second component isn’t as easy to decipher. The players at the negative end seem to players that are preferred by analytically minded analysts (think Moneyball). Raines, Trammell, and Martinez have more support among this group of voters. Morris, however, has less support among these voters and he isn’t that far separated from them.dotchart(sort(pca$loadings[, 2]), main = "Second Principal Component")
ggplot(data.frame(cbind(pca$scores[, 1:2], votes))) + geom_point(aes(Comp.1,
Comp.2, colour = as.factor(Barry.Bonds), shape = as.factor(Roger.Clemens)),
size = 4) + coord_equal() + labs(colour = "Bonds", shape = "Clemens")
ggplot(data.frame(cbind(pca$scores[, 1:2], votes))) + geom_point(aes(Comp.1,
Comp.2, colour = as.factor(paste(Roger.Clemens, "/", Jeff.Bagwell))), size = 4) +
geom_hline(aes(0), size = 0.2) + geom_vline(aes(0), size = 0.2) + coord_equal() +
labs(colour = "Bonds / Bagwell")
ggplot(data.frame(cbind(pca$scores[, 1:2], votes))) + geom_point(aes(Comp.1,
Comp.2, colour = as.factor(paste(Barry.Bonds, "/", Curt.Schilling))), size = 4) +
geom_hline(aes(0), size = 0.2) + geom_vline(aes(0), size = 0.2) + coord_equal() +
labs(colour = "Bonds / Schilling")
PCbiplot < function(PC = fit, x = "PC1", y = "PC2") {
# PC being a prcomp object
library(grid)
data < data.frame(obsnames = row.names(PC$x), PC$x)
plot < ggplot(data, aes_string(x = x, y = y)) + geom_text(alpha = 0.4,
size = 3, aes(label = obsnames))
plot < plot + geom_hline(aes(0), size = 0.2) + geom_vline(aes(0), size = 0.2)
datapc < data.frame(varnames = rownames(PC$rotation), PC$rotation)
mult < min((max(data[, y])  min(data[, y])/(max(datapc[, y])  min(datapc[,
y]))), (max(data[, x])  min(data[, x])/(max(datapc[, x])  min(datapc[,
x]))))
datapc < transform(datapc, v1 = 0.7 * mult * (get(x)), v2 = 0.7 * mult *
(get(y)))
plot < plot + coord_equal() + geom_text(data = datapc, aes(x = v1, y = v2,
label = varnames), size = 5, vjust = 1, color = "red")
plot < plot + geom_segment(data = datapc, aes(x = 0, y = 0, xend = v1,
yend = v2), arrow = arrow(length = unit(0.2, "cm")), alpha = 0.75, color = "red")
plot
}
fit < prcomp(votes, scale = F)
PCbiplot(fit)
I could have also attempted to rotate the factors to make them more interpretable, but they appeared to have easy interpretation as is. Since we were looking at 2d plots, rotation would not have made a difference in interpreting the plots. It is also common to use a likelihood approach to estimate factors. I chose to use the principal component method because the data are definitely not normal (being 0’s and 1’s).
]]>library(XML)
library(reshape)
library(ggplot2); theme_set(theme_bw())
dat < pollstR(pages=20)
ggplot(dat,aes(end.date,Obama/(Obama+Romney)))+geom_point(alpha=.5)+geom_smooth(aes(weight=sqrt(N)))+geom_hline(aes(yintercept=0.5),lty=2,size=1)+
labs(title="Proportion of Vote for Obama",x="Last Date of Poll",y=NULL)
swing.states=c("ohio","florida","virginia","colorado","nevada","northcarolina")
for (s in swing.states) {
print(s)
dat.state < pollstR(chart=paste("2012",s,"presidentromneyvsobama",sep=""),pages="all")
dat.state=subset(dat.state,select=c("id","pollster","start.date","end.date","method","N","Obama","Romney"))
dat.state$State=s
if (s=="ohio") {
dat=dat.state
} else {
dat=rbind(dat,dat.state)
}
}
library(lubridate)
dat$end.date=ymd(as.character(dat$end.date))
ggplot(dat,aes(end.date,Obama/(Obama+Romney)))+geom_point(alpha=.5)+geom_smooth(aes(weight=sqrt(N)))+geom_hline(aes(yintercept=0.5),lty=2,size=1)+
labs(title="Proportion of Vote for Obama",x="Last Date of Poll",y=NULL)+facet_wrap(~State)+xlim(c(mdy("8/1/2012"),mdy("11/6/2012")))
num.cols=sum(th)
fo.str="MEDV ~"
cum.cols=0
for (i in 1:length(th)) {
if (th[i]>0) {
if (is.factor(train[,i])) {
fo.str=paste(fo.str,colnames(train)[i],sep=" ")
} else {
fo.str=paste(fo.str," s(",colnames(train)[i],")",sep="")
}
cum.cols=cum.cols+1
if (cum.cols<num.cols) {
fo.str=paste(fo.str,"+")
}
}
}
fo=as.formula(fo.str)
library(mgcv)
library(tabuSearch)
# http://archive.ics.uci.edu/ml/datasets/Housing
housing=read.table("http://archive.ics.uci.edu/ml/machinelearningdatabases/housing/housing.data")
colnames(housing)=c("CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV")
housing$CHAS=as.factor(housing$CHAS)
housing$RAD=as.factor(housing$RAD) # Changed to factor bc only 9 unique values
summary(housing)
set.seed(20120823)
cv=sample(nrow(housing))
train=housing[cv[1:300],]
valid=housing[cv[301:400],]
test=housing[cv[401:nrow(housing)],]
ssto=sum((valid$MEDVmean(valid$MEDV))^2)
evaluate < function(th){
num.cols=sum(th)
if (num.cols == 0) return(0)
fo.str="MEDV ~"
cum.cols=0
for (i in 1:length(th)) {
if (th[i]>0) {
if (is.factor(train[,i])) {
fo.str=paste(fo.str,colnames(train)[i],sep=" ")
} else {
fo.str=paste(fo.str," s(",colnames(train)[i],")",sep="")
}
cum.cols=cum.cols+1
if (cum.cols<num.cols) {
fo.str=paste(fo.str,"+")
}
}
}
# colnames(train)[c(th,0)==1]
fo=as.formula(fo.str)
gam1 < gam(fo,data=train)
pred1 < predict(gam1,valid,se=FALSE)
sse < sum((pred1valid$MEDV)^2,na.rm=TRUE)
return(1sse/ssto)
}
res < tabuSearch(size = ncol(train)1, iters = 20, objFunc = evaluate, listSize = 5,
config = rbinom(ncol(train)1,1,.5), nRestarts = 4,verbose=TRUE)
MEDV ~ s(CRIM) + s(INDUS) + s(NOX) + s(RM) + s(DIS) + RAD + s(TAX) + s(PTRATIO) + s(LSTAT).
library(reshape)
library(ggplot2); theme_set(theme_bw())
tabu.df=data.frame(res$configKeep)
colnames(tabu.df)=colnames(train)[1:(ncol(train)1)]
tabu.df$Iteration=1:nrow(tabu.df)
tabu.df$RSquared=res$eUtilityKeep
tabu.df$Rank=rank(tabu.df$RSquared)
tabu.melt=melt(tabu.df,id=c("Iteration","RSquared","Rank"))
tabu.melt$RSquared=ifelse(tabu.melt$value==1,tabu.melt$RSquared,0)
tabu.melt$Rank=ifelse(tabu.melt$value==1,tabu.melt$Rank,0)
(pHeat01 < ggplot(tabu.melt, aes(Iteration,variable)) + geom_tile(aes(fill = value)) +
scale_fill_gradient(low = "white", high = "steelblue",guide=FALSE))
(pHeatRank < ggplot(tabu.melt, aes(Iteration,variable)) + geom_tile(aes(fill = Rank)) +
scale_fill_gradient(low = "white", high = "steelblue"))
# simulate the data
x1=rnorm(1000)
x2=rnorm(1000,x1,1)
y=2*x1+rnorm(1000,0,.5)
df=data.frame(y,x1,x2,x3=rnorm(1000),x4=rnorm(1000),x5=rnorm(1000))
# run the randomForest implementation
library(randomForest)
rf1 < randomForest(y~., data=df, mtry=2, ntree=50, importance=TRUE)
importance(rf1,type=1)
# run the party implementation
library(party)
cf1 < cforest(y~.,data=df,control=cforest_unbiased(mtry=2,ntree=50))
varimp(cf1)
varimp(cf1,conditional=TRUE)
> round(0.5,0)
[1] 0
> round(1.5,0)
[1] 2
> round(2.5,0)
[1] 2
> round(3.5,0)
[1] 4
> round(4.5,0)
[1] 4
> round(5.5,0)
[1] 6
> round(6.5,0)
[1] 6
na.strings = "" # to prevent replacing NA string to missing value
comment.char = "" # to not loose everything after # sign
quote = "" # or ' or " could mess with data
check.names = FALSE # if you want column names as in excel (spaces, special characters, etc.). You need to use `column name` in R to reference such columns.
For writing na="" replace missing values by empty string and not "NA" as on default.
Second thing is that you can increase size of clipboard by using e.g. "clipboard10240" instead of "clipboard" (it's a size in Kb, so it's around 10Mb; see help for connection, section Clipboard) which allow to copy and paste larger tables.
y < scan()
Also, it seems that, libreOffice also uses clipboard to store copied things. This function also works for libreOffice
read.delim("clipboard")
(The "clipboard" parameter is 'doze only for the foreseeable future)
From "?read.delim"
read.delim(file, header = TRUE, sep = "\t", quote="\"", dec=".",
fill = TRUE, comment.char="", …)
Ernesto
read.clipboard.mac < function(header=TRUE,…) {
read.table(pipe("pbpaste"),sep="\t",header=header,…)
}
Ernesto