Statistically Significant

Andrew Landgraf's Blog

Text Decryption Using MCMC

| Comments

The famous probabilist and statistician Persi Diaconis wrote an article not too long ago about the ”Markov chain Monte Carlo (MCMC) Revolution.” The paper describes how we are able to solve a diverse set of problems with MCMC. The first example he gives is a text decryption problem solved with a simple Metropolis Hastings sampler.

I was always stumped by those cryptograms in the newspaper and thought it would be pretty cool if I could crack them with statistics. So I decided to try it out on my own. The example Diaconis gives is fleshed out in more details by its original authors in its own article.

The decryption I will be attempting is called substitution cipher, where each letter of the alphabet corresponds to another letter (possibly the same one). This is the rule that you must follow for the cryptogram game too. There are 26! = 4e26 possible mappings of the letters of the alphabet to one another. This seems like a hopeless problem to solve, but I will show you that a relatively simple solution can be found as long as your reference text is large enough and the text you are trying to decipher is also large enough.

The strategy is to use a reference text to create transition probabilities from each letter to the next. This can be stored in a 26 by 26 matrix, where the ith row and jth column is the probability of the jth letter, given the ith letter preceded it. For example, the given the previous letter was a Q, U almost always follows it, so the Q-row and U-column would be close to probability of 1. Assuming these one step transition probabilities are what matter, the likelihood of any mapping is the product of the transition probabilities observed.

To create a transition matrix, I downloaded War and Peace from Project Gutenberg. I looped through each letter and kept track of the number of number of times each letter followed the previous. I also kept track of a 27th character, which was anything that was not a letter – for example periods, commas, spaces, etc. This lets me know which letters are more likely to start a word or end a word. Consecutive entries of these special characters are not considered.

After I have these counts, I normalize by dividing by the row total. Before normalizing, I add 1 to each cell so that there are no probabilities of 0. This also corresponds to prior information of each transition being equally likely. I could have tried to give more informative prior information (like Q to U being likely), but it would have been difficult and probably inaccurate for the whole matrix.

Below is the code for creating the transition probability matrix. Note that I loop through each line and within each line I loop through each character.

reference=readLines("warandpeace.txt")
reference=toupper(reference)
 
trans.mat=matrix(0,27,27)
rownames(trans.mat)=colnames(trans.mat)=c(toupper(letters),"")
lastletter=""
for (ln in 1:length(reference)) {
if (ln %% 1000 ==0) {cat("Line",ln,"\n")}
for (pos in 1:nchar(reference[ln])) {
curletter=substring(reference[ln],pos,pos)
if (curletter %in% toupper(letters)) {
trans.mat[rownames(trans.mat)==lastletter,
colnames(trans.mat)==curletter]=
trans.mat[rownames(trans.mat)==lastletter,
colnames(trans.mat)==curletter]+1
lastletter=curletter
} else {
if (lastletter!="") {
trans.mat[rownames(trans.mat)==lastletter,27]=
trans.mat[rownames(trans.mat)==lastletter,27]+1
lastletter=""
}
}
}
curletter=""
if (lastletter!="") {
trans.mat[rownames(trans.mat)==lastletter,27]=
trans.mat[rownames(trans.mat)==lastletter,27]+1
}
lastletter=""
}
 
trans.prob.mat=sweep(trans.mat+1,1,rowSums(trans.mat+1),FUN="/")
Created by Pretty R at inside-R.org


I like to visualize my data to make sure everything looks correct. Below is a plot of the transition matrix.  The blank space corresponds to non-letter character. A lot of insights can be garnered from this plot. The Q-U relationship pops out. T is the most likely letter to start a word. E is a popular letter to follow many others. Y is likely to end a word.

ggplot(melt(trans.prob.mat),aes(Var2,Var1))+geom_tile(aes(fill=value))+
scale_fill_gradient(low="white",high="black",limits=c(0,1))+
labs(x="Probability of Second Letter",y="Conditioning on First Letter",fill="Prob")+
scale_y_discrete(limits = rev(levels(melt(trans.prob.mat)$Var1)))+
coord_equal()
Created by Pretty R at inside-R.org


The desired solution will be the one that gives the highest likelihood. This is an NP-hard problem so the only way to find the solution is to try all 26! combinations of mappings, which is infeasible, and report the one with the highest likelihood.

With the MCMC approach you start with a random mapping of letters. Next you propose a new mapping by randomly switching 2 of the characters in the mapping. So if A mapped to G and L mapped to Z and you switched those two, A would map to Z and L would map to G. With this new mapping, you measure the likelihood and divide it by the likelihood of the previous mapping. If this ratio is greater than 1, then move to this new mapping. If the ratio is less than 1, meaning the new mapping is less likely, then move to this new mapping with a probability equal to the ratio. (Practically, this is done by drawing a random uniform number between 0 and 1 and moving to the proposed mapping if the uniform number is less than the ratio.) Repeat this process until you think you have found a solution.

To do this, I first wrote a few functions. One to decode the coded text based on the mapping. The other was to calculate the log likelihood of the decoded text.

decode <- function(mapping,coded) {
coded=toupper(coded)
decoded=coded
for (i in 1:nchar(coded)) {
if (substring(coded,i,i) %in% toupper(letters)) {
substring(decoded,i,i)=toupper(letters[mapping==substring(coded,i,i)])
}
}
decoded
}
 
 
log.prob <- function(mapping,decoded) {
logprob=0
 
lastletter=""
for (i in 1:nchar(decoded)) {
curletter=substring(decoded,i,i)
if (curletter %in% toupper(letters)) {
logprob=logprob+log(trans.prob.mat[rownames(trans.mat)==lastletter,
colnames(trans.mat)==curletter])
lastletter=curletter
} else {
if (lastletter!="") {
logprob=logprob+log(trans.prob.mat[rownames(trans.mat)==lastletter,27])
lastletter=""
}
}
}
 
if (lastletter!="") {
logprob=logprob+log(trans.prob.mat[rownames(trans.mat)==lastletter,27])
lastletter=""
}
logprob
}
Created by Pretty R at inside-R.org

To show how this works, I will use the same Shakespeare text that is used in the MCMC Revolution paper. I let it run until it tries out 2000 different mappings (not the same as 2000 iterations, because rejected proposals are not counted). Along the way I am keeping track of the most likely mapping visited, and that will be the final estimate. It should be noted that I am only considering the mapping of letters and it is assumed that the special characters are known. For example spaces and periods are assumed to be the same.

correctTxt="ENTER HAMLET HAM TO BE OR NOT TO BE THAT IS THE QUESTION WHETHER TIS NOBLER IN THE MIND TO SUFFER THE SLINGS AND ARROWS OF OUTRAGEOUS FORTUNE OR TO TAKE ARMS AGAINST A SEA OF TROUBLES AND BY OPPOSING END"
coded=decode(sample(toupper(letters)),correctTxt) # randomly scramble the text
 
mapping=sample(toupper(letters)) # initialize a random mapping
i=1
iters=2000
cur.decode=decode(mapping,coded)
cur.loglike=log.prob(mapping,cur.decode)
max.loglike=cur.loglike
max.decode=cur.decode
while (i<=iters) {
proposal=sample(1:26,2) # select 2 letters to switch
prop.mapping=mapping
prop.mapping[proposal[1]]=mapping[proposal[2]]
prop.mapping[proposal[2]]=mapping[proposal[1]]
 
prop.decode=decode(prop.mapping,coded)
prop.loglike=log.prob(prop.mapping,prop.decode)
 
if (runif(1)<exp(prop.loglike-cur.loglike)) {
mapping=prop.mapping
cur.decode=prop.decode
cur.loglike=prop.loglike
 
if (cur.loglike>max.loglike) {
max.loglike=cur.loglike
max.decode=cur.decode
}
 
cat(i,cur.decode,"\n")
i=i+1
}
}
Created by Pretty R at inside-R.org


The output of this example is below. You can see that it comes close but it doesn’t quite find the correct mapping. I attribute this to the fact that the text I was trying to decode only had 203 characters. In the paper mentioned above, they decoded the whole play. They further say that their methods work just as well if you randomly sample a text snippet 2000 characters long. Obviously my example had well less than this. This is also a problem in trying to solve a cryptogram because those are even smaller than the Hamlet example I used. So unfortunately this simple method cannot be used for that. (Note that I ran the MCMC chain a several times, using different random starting points, until it came reasonably close to the solution. This is something that the authors also suggested doing.)



I want to also note that I originally used Alice and Wonderland as the reference text. It had more trouble decoding since this book is much smaller than War and Peace, and therefore the transition probabilities were not as good.

The MCMC method is a greedy approach in that you are moving to any mapping that increases the likelihood. Greedy methods are not optimal in general because they can get stuck at local modes, especially in high dimensional problems. MCMC avoids this be moving to less likely mappings with some probability, which ensures that it will find the correct solution with enough time.

Comments

Vivek
Andrew,

This work is cool!

I wrote my own substitution cipher decoder using the algorithm described in Stephen Connor's dissertation as the basis, but I occasionally glanced at your work as well.

My code doesn't work as awesomely as yours, and it is also much slower (possibly because I use 76 versus the 27 characters, reachChar versus readLine), and I use different functions.

But I cited your work and copied your idea for the melt matrix, which is ingenious.

Thanks so much!

Vivek
Andrew Landgraf
Thanks, Fernando. I knew I could speed it up by treating it as a vector of letters (or integers), but didn't take the time to figure it out.
Fernando
Nice work, but the code to get the transition matrix is VERY slow. I manage to make it faster doing the following:
1- get the vector of words (using scan() for example).
2- For each word, use strsplit(word, NULL)[[1]] and iterate
over the characters to populate the matrix.
Andrew
I have had that issue working on different computers. I think older versions of reshape2 use X# and newer versions use Var#. Or maybe it's a difference between using the package reshape and reshape2?
Pablik
Great work! Sometimes I use Persi's example for teaching MCMC,
there are some old matlab code from the original consultancy work,
but I never tryed to replicate the example.

I got an error at:
ggplot(melt(trans.prob.mat),aes(Var2,Var1))+…

the melt() function uses X1 and X2:
> tmp <- melt(trans.prob.mat)
> head(tmp)
X1 X2 value
1 A A 3.893588e-05
2 B A 9.252956e-02

Cheers,

Pablo

Restricted Boltzmann Machines in R

| Comments

Restricted Boltzmann Machines (RBMs) are an unsupervised learning method (like principal components). An RBM is a probabilistic and undirected graphical model. They are becoming more popular in machine learning due to recent success in training them with contrastive divergence. They have been proven useful in collaborative filtering, being one of the most successful methods in the Netflix challenge (paper). Furthermore, they have been tantamount to training deep learning models, which appear to be the best current models for image and digit recognition.

I do not want to go into too much detail, but I would like to give a high level overview of RBMs. Edwin Chen has an introduction that is much better. The usual version of an RBM is a probabilistic model for binary vectors. An image can be represented as a binary vector if each pixel that is dark enough is represented as a 1 and the non-dark pixels are 0’s. In addition to the visible binary vector, it is assumed that there is a hidden binary vector, and each element of the hidden unit is connected to each unit of the visible unit by symmetric weights. The weights are represented by the matrix W, where the i,jth component is the weight between hidden unit i and visible unit j. It is important that there are no connections between hidden units or between visible units. The probability that visible unit j is 1 is the inverse logistic function of the sum of the weights connected to visible unit j, in which the hidden units are 1. In math notation (where σ is the inverse logistic, or sigmoid, function):

Pr(vj=1|h,W)=σ(ihiWi,j).

The weights are symmetric, so

Pr(hi=1|v,W)=σ(jvjWi,j).
As you can see, the model is defined by its conditional probabilities. The task is to find the weight matrix W that best explains the visible data for a given number of hidden units.

I have been taking Geoff Hinton’s coursera course on neural networks, which I would recommend to anyone interested. One of the programming assignments was to fill in code to write an RBM in Matlab/Octave. I have since tried to find a version for R, but have not had any luck, so I decided to translate the code myself. Below is the code to calculate the weight matrix. It is fairly simple and I only use contrastive divergence 1. The input data is p×n instead of the usual transpose.

# rbm_w is a matrix of size <number of hidden units> by <number of visible units>
# visible_state is matrix of size <number of visible units> by <number of data cases>
# hidden_state is a binary matrix of size <number of hidden units> by <number of data cases>
 
visible_state_to_hidden_probabilities <- function(rbm_w, visible_state) {
1/(1+exp(-rbm_w %*% visible_state))
}
 
hidden_state_to_visible_probabilities <- function(rbm_w, hidden_state) {
1/(1+exp(-t(rbm_w) %*% hidden_state))
}
 
configuration_goodness <- function(rbm_w, visible_state, hidden_state) {
out=0
for (i in 1:dim(visible_state)[2]) {
out=out+t(hidden_state[,i]) %*% rbm_w %*% visible_state[,i]
}
out/dim(visible_state)[2]
}
 
configuration_goodness_gradient <- function(visible_state, hidden_state) {
hidden_state %*% t(visible_state)/dim(visible_state)[2]
}
 
sample_bernoulli <- function(mat) {
dims=dim(mat)
matrix(rbinom(prod(dims),size=1,prob=c(mat)),dims[1],dims[2])
}
 
cd1 <- function(rbm_w, visible_data) {
visible_data = sample_bernoulli(visible_data)
H0=sample_bernoulli(visible_state_to_hidden_probabilities(rbm_w, visible_data))
vh0=configuration_goodness_gradient(visible_data, H0)
V1=sample_bernoulli(hidden_state_to_visible_probabilities(rbm_w, H0))
H1=visible_state_to_hidden_probabilities(rbm_w, V1)
vh1=configuration_goodness_gradient(V1, H1)
vh0-vh1
}
 
rbm <- function(num_hidden, training_data, learning_rate, n_iterations, mini_batch_size=100, momentum=0.9, quiet=FALSE) {
# This trains a model that's defined by a single matrix of weights.
# <num_hidden> is the number of hidden units
# cd1 is a function that takes parameters <model> and <data> and returns the gradient (or approximate gradient in the case of CD-1) of the function that we're maximizing. Note the contrast with the loss function that we saw in PA3, which we were minimizing. The returned gradient is an array of the same shape as the provided <model> parameter.
# This uses mini-batches no weight decay and no early stopping.
# This returns the matrix of weights of the trained model.
n=dim(training_data)[2]
p=dim(training_data)[1]
if (n %% mini_batch_size != 0) {
stop("the number of test cases must be divisable by the mini_batch_size")
}
model = (matrix(runif(num_hidden*p),num_hidden,p) * 2 - 1) * 0.1
momentum_speed = matrix(0,num_hidden,p)
 
start_of_next_mini_batch = 1;
for (iteration_number in 1:n_iterations) {
if (!quiet) {cat("Iter",iteration_number,"\n")}
mini_batch = training_data[, start_of_next_mini_batch:(start_of_next_mini_batch + mini_batch_size - 1)]
start_of_next_mini_batch = (start_of_next_mini_batch + mini_batch_size) %% n
gradient = cd1(model, mini_batch)
momentum_speed = momentum * momentum_speed + gradient
model = model + momentum_speed * learning_rate
}
return(model)
}
Created by Pretty R at inside-R.org

I loaded the hand written digit data that was given in the class. To train the RBM, use the syntax below.

weights=rbm(num_hidden=30, training_data=train, learning_rate=.09, n_iterations=5000,
mini_batch_size=100, momentum=0.9)
Created by Pretty R at inside-R.org

After training the weights, I can visualize them. Below is a plot of strength of the weights going to each pixel. Each facet displays the weights going to/coming from one of the hidden units. I trained 30 units so that it would be easy to show them all on one plot. You can see that most of the hidden units correspond strongly to one digit or another. I think this means it is finding the joint structure of all of the pixels and that pixels representing those numbers are darkened together often.

library(ggplot2)
library(reshape2)
mw=melt(weights); mw$Var3=floor((mw$Var2-1)/16)+1; mw$Var2=(mw$Var2-1)%%16 + 1; mw$Var3=17-mw$Var3;
ggplot(data=mw)+geom_tile(aes(Var2,Var3,fill=value))+facet_wrap(~Var1,nrow=5)+
scale_fill_continuous(low='white',high='black')+coord_equal()+
labs(x=NULL,y=NULL,title="Visualization of Weights")+
theme(legend.position="none")
Created by Pretty R at inside-R.org

Comments

Andrew Landgraf
There are no biases in this implementation. You could make v_1=1, but there still wouldn't be biases in the hidden units.
sidharth
where are the biases ? have you included them in the weights ?
Andrew Landgraf
The weights are randomly initialized, so any 2 runs will give different results. However, if your figure looks similar to mine, in that it looks like the weights represent numbers, I would say it is working correctly.
Xiaolin Gui
Thank you for sharing your code.I found the output figure had different with yours when I do the experiment. I have load the MNIST image data successfully. And the data's format is 784 ×60000. The pixels have been binarization. >= 127——>1.
I hope you give me some advice.
email:guixiaolinde@gmail.com
Anonymous
I have managed to import the training data (from the Coursera website) into R:

library(R.matlab)
dat = readMat("data_set.mat")
……
weights=rbm(num_hidden=30, training_data= dat$data[,,1]$training[[1]], learning_rate=.09, n_iterations=5000,
mini_batch_size=100, momentum=0.9)
Kent Johnson
R code for assignment 4 was posted here:
https://class.coursera.org/neuralnets-2012-001/forum/thread?thread_id=87&post_id=5105
I used it as the basis for a work project.
Zachary Mayer
I believe there's also a kaggle competition using that data right now: http://www.kaggle.com/c/digit-recognizer
Andrew
I don't want to post it here because of possible copyright violations, but you can download it in Octave format from the Coursera site (in programming assignment 4) or you can find it at this webpage: http://yann.lecun.com/exdb/mnist/
Anonymous
Do you have a sample of training data?

Factor Analysis of Baseball’s Hall of Fame Voters

| Comments

Factor Analysis of Baseball’s Hall of Fame Voters
Recently, Nate Silver wrote a post which analyzed how voters who voted for and against Barry Bonds for Baseball’s Hall of Fame differed. Not surprisingly, those who voted for Bonds were more likely to vote for other suspected steroids users (like Roger Clemens). This got me thinking that this would make an interesting case study for factor analysis to see if there are latent factors that drive hall of fame voting.

The Twitter user @leokitty has kept track of all the known ballots of the voters in a spreadsheet. The spreadsheet is a matrix that has one row for each voter and one column for each player being voted upon. (Players need 75% of the vote to make it to the hall of fame.) I removed all players that had no votes and all voters that had given a partial ballot.

(This matrix either has a 1 or a 0 in each entry, corresponding to whether a voter voted for the player or not. Note that this kind of matrix is similar to the data that is analyzed in information retrieval. I will be decomposing the (centered) matrix using singular value decomposition to run the factor analysis. This is the same technique used for latent semantic indexing in information retrieval.)

Starting with the analysis, there is a drop off of the variance after the first 2 factors, which means it might make sense to look only at the first 2 (which is good because I was only planning on doing so).

votes = read.csv("HOF votes.csv", row.names = 1, header = TRUE)
pca = princomp(votes)
screeplot(pca, type = "l", main = "Scree Plot")

Looking at the loadings, it appears that the first principal component corresponds strongly to steroid users, which Bonds and Clemens having large negative values and other suspected steroid users being on the negative end. The players on the positive end have no steroid suspicions.

dotchart(sort(pca$loadings[, 1]), main = "First Principal Component")
The second component isn’t as easy to decipher. The players at the negative end seem to players that are preferred by analytically minded analysts (think Moneyball). Raines, Trammell, and Martinez have more support among this group of voters. Morris, however, has less support among these voters and he isn’t that far separated from them.

There also may be some secondary steroid association in the component as well separating players who have proof of steroid use versus those which have no proof but “look like” they took steroids. For example, there is no hard evidence that Bagwell or Piazza took steroids, but they were very muscular and hit a lot of home runs, so they are believed to have taken steroids. There is some sort of evidence the top five players of this component did take steroids.

dotchart(sort(pca$loadings[, 2]), main = "Second Principal Component")

Projecting the votes onto two dimensions, we can look at how the voters for Bonds and Clemens split up. You can see there is a strong horizontal split between the voters for and against Bonds/Clemens. There are also 3 voters that voted for Bonds, but not Clemens.

ggplot(data.frame(cbind(pca$scores[, 1:2], votes))) + geom_point(aes(Comp.1, 
Comp.2, colour = as.factor(Barry.Bonds), shape = as.factor(Roger.Clemens)),
size = 4) + coord_equal() + labs(colour = "Bonds", shape = "Clemens")

Similarly, I can look at how the voters split up on the issue of steroids by looking at both Bonds and Bagwell. The voters in the upper left do not care about steroid use, but believe that Bagwell wasn’t good enough to make it to the hall of fame. The voters in the lower right do care about steroid use, but believe that Bagwell was innocent of any wrongdoing.

ggplot(data.frame(cbind(pca$scores[, 1:2], votes))) + geom_point(aes(Comp.1, 
Comp.2, colour = as.factor(paste(Roger.Clemens, "/", Jeff.Bagwell))), size = 4) +
geom_hline(aes(0), size = 0.2) + geom_vline(aes(0), size = 0.2) + coord_equal() +
labs(colour = "Bonds / Bagwell")

We can also look at a similar plot with Schilling instead of Bagwell. The separation here appears to be stronger.

ggplot(data.frame(cbind(pca$scores[, 1:2], votes))) + geom_point(aes(Comp.1, 
Comp.2, colour = as.factor(paste(Barry.Bonds, "/", Curt.Schilling))), size = 4) +
geom_hline(aes(0), size = 0.2) + geom_vline(aes(0), size = 0.2) + coord_equal() +
labs(colour = "Bonds / Schilling")

Finally, we can look at a biplot (using code from here).

PCbiplot <- function(PC = fit, x = "PC1", y = "PC2") {
# PC being a prcomp object
library(grid)
data <- data.frame(obsnames = row.names(PC$x), PC$x)
plot <- ggplot(data, aes_string(x = x, y = y)) + geom_text(alpha = 0.4,
size = 3, aes(label = obsnames))
plot <- plot + geom_hline(aes(0), size = 0.2) + geom_vline(aes(0), size = 0.2)
datapc <- data.frame(varnames = rownames(PC$rotation), PC$rotation)
mult <- min((max(data[, y]) - min(data[, y])/(max(datapc[, y]) - min(datapc[,
y]))), (max(data[, x]) - min(data[, x])/(max(datapc[, x]) - min(datapc[,
x]))))
datapc <- transform(datapc, v1 = 0.7 * mult * (get(x)), v2 = 0.7 * mult *
(get(y)))
plot <- plot + coord_equal() + geom_text(data = datapc, aes(x = v1, y = v2,
label = varnames), size = 5, vjust = 1, color = "red")
plot <- plot + geom_segment(data = datapc, aes(x = 0, y = 0, xend = v1,
yend = v2), arrow = arrow(length = unit(0.2, "cm")), alpha = 0.75, color = "red")
plot
}

fit <- prcomp(votes, scale = F)
PCbiplot(fit)
I could have also attempted to rotate the factors to make them more interpretable, but they appeared to have easy interpretation as is. Since we were looking at 2-d plots, rotation would not have made a difference in interpreting the plots. It is also common to use a likelihood approach to estimate factors. I chose to use the principal component method because the data are definitely not normal (being 0’s and 1’s).

Quick Post About Getting and Plotting Polls in R

| Comments

With the election nearly upon us, I wanted to share an easy way I just found to download polling data and graph a few with ggplot2. dlinzer at github created a function to download poll data from the Huffington Post’s Pollster API.

The default is to download national tracking polls from the presidential election. After sourcing the function, I load the required packages, download the data, and make the plot.
library(XML)
library(reshape)
library(ggplot2); theme_set(theme_bw())
 
dat <- pollstR(pages=20)
ggplot(dat,aes(end.date,Obama/(Obama+Romney)))+geom_point(alpha=.5)+geom_smooth(aes(weight=sqrt(N)))+geom_hline(aes(yintercept=0.5),lty=2,size=1)+
labs(title="Proportion of Vote for Obama",x="Last Date of Poll",y=NULL)
Created by Pretty R at inside-R.org
I have used transparency so that you can see when there are many polls on top of each other. You can see that Obama’s lead decreased substantially after the first debate but has crawled back up since then. Of course, I am treating all polls as equal (although I am weighting by sample size) when the truth is that some polls are better than others and some are biased.

To have some more fun, I will show what some of the data from swing states look like. The code below loops through the swing states and downloads the polls. Then it plots the polls for each state in different facets.

swing.states=c("ohio","florida","virginia","colorado","nevada","north-carolina")
for (s in swing.states) {
print(s)
dat.state <- pollstR(chart=paste("2012-",s,"-president-romney-vs-obama",sep=""),pages="all")
dat.state=subset(dat.state,select=c("id","pollster","start.date","end.date","method","N","Obama","Romney"))
dat.state$State=s
 
if (s=="ohio") {
dat=dat.state
} else {
dat=rbind(dat,dat.state)
}
}
 
library(lubridate)
dat$end.date=ymd(as.character(dat$end.date))
ggplot(dat,aes(end.date,Obama/(Obama+Romney)))+geom_point(alpha=.5)+geom_smooth(aes(weight=sqrt(N)))+geom_hline(aes(yintercept=0.5),lty=2,size=1)+
labs(title="Proportion of Vote for Obama",x="Last Date of Poll",y=NULL)+facet_wrap(~State)+xlim(c(mdy("8/1/2012"),mdy("11/6/2012")))
Created by Pretty R at inside-R.org

Unfortunately the x-axis didn’t show up very well, but it starts at August 1. There have been quite a few polls in Ohio and Florida, haven’t there? The state polls did not have nearly the same shift that the national poll did in reaction to the first debate. The state with the largest bump is Colorado, where the debate was held.

By just looking at the tracking polls, I think you would make the same conclusions that Nate Silver has with his fancy model. Ohio, Virginia, Nevada, and Colorado favor Obama. North Carolina favors Romney and Florida just barely tips toward Romney as well.

Finally, here are just the smoothed running means, all on one plot. You can see that There was also a first debate effect in Ohio.

ggplot(dat,aes(end.date,Obama/(Obama+Romney)))+geom_smooth(aes(colour=State,weight=sqrt(N)),se=FALSE,size=2)+geom_hline(aes(yintercept=0.5),lty=2,size=1)+
labs(title="Proportion of Vote for Obama",x="Last Date of Poll",y=NULL)+xlim(c(mdy("8/1/2012"),mdy("11/6/2012")))
Created by Pretty R at inside-R.org

Comments

Andrew Clark
Interesting work. Thanks for the tip on the API/github. I have had a look at the GOP campaign - surely one of the strangest in history
http://wp.me/p17axt-jK
Andrew
Thanks for the tip. To add more information to the national tracker plot, you can make the of the points relative to the number polled. Also facet it by the type of poll. Like this:

ggplot(dat,aes(end.date,Obama/(Obama+Romney)))+
geom_point(aes(size=sqrt(N)),alpha=.5)+
geom_smooth(aes(weight=sqrt(N)))+
geom_hline(aes(yintercept=0.5),lty=2,size=1)+
labs(title="Proportion of Vote for Obama",x="Last Date of Poll",y=NULL)+
facet_wrap(~method)+
theme(axis.text.x=element_text(angle=-90))
Tony Hirst
You can tidy up the x-axis labels by rotating them, eg using something like:
+theme(axis.text.x=element_text(angle=-90))
Tony Hirst
This comment has been removed by the author.

Finding the Best Subset of a GAM Using Tabu Search and Visualizing It in R

| Comments

Finding the best subset of variables for a regression is a very common task in statistics and machine learning. There are statistical methods based on asymptotic normal theory that can help you decide whether to add or remove a variable at a time. The problem with this is that it is a greedy approach and you can easily get stuck in a local mode. Another approach is to look at all possible subsets of the variables and see which one maximizes an objective function (accuracy on a test set, for example).

Heuristics are required if the number of variables, p, gets large (>40) because the number of possible subsets is equal to 2^p, excluding interactions. One method that works well is called tabu search (TS). It is a more general optimization method, but in the case of finding the best subset, it is similar to stepwise methods. At any point, it will try to improve the objective function the most by adding or removing one variable. One difference is that TS will not add or remove variables that have been recently added or removed. This avoids the optimization from getting stuck at a local maximum. There are more details to the algorithm that you can read up on if you would like.

There is a tabuSearch package in R that implements this algorithm. I wanted to find a best subset regression using generalized additive models (GAMs) of the package mgcv. An issue arose, because you need to specify which terms are smooth and which are not in a formula to use mgcv. The general form of the formula looks like:
gam(y ~ s(x1) + s(x2) + x3, data=train)
where x1 and x2 are smooth terms and x3 is treated like it is in a linear model.

My data set has a combination of continuous variables, which I want to treat as smooth, and categorical variables, which I want to treat as factors. For each variable in the subset, I need to identify whether it is a factor or not, and then creating a string of the variable with or without s().

For example, th is a vector of 0’s and 1’s which indicate whether each variable (column) is in the regression. I used the housing data set from UCI ML repository. With 13 explanatory variables, there are 8192 possible main effects models. I am predicting MEDV, so I start the formula string with “MEDV ~”. Then I loop through each element of th. If it is 1 then I want to add it to the formula. I check if it is a factor and if so I just add the name of the variable plus a “+”. If it is continuous, I add the column name with ”s()” around it. Finally I convert the string to a formula using formula(). I can plug in this formula into the gam function.

num.cols=sum(th)
fo.str="MEDV ~"
cum.cols=0
for (i in 1:length(th)) {
if (th[i]>0) {
if (is.factor(train[,i])) {
fo.str=paste(fo.str,colnames(train)[i],sep=" ")
} else {
fo.str=paste(fo.str," s(",colnames(train)[i],")",sep="")
}
cum.cols=cum.cols+1
if (cum.cols<num.cols) {
fo.str=paste(fo.str,"+")
}
}
}
fo=as.formula(fo.str)
Created by Pretty R at inside-R.org

For evaluating the subsets, I split the data into training, validation, and testing. I trained the subset on the training data set and measured the R-squared on the prediction of the validation set. The full code can be found below.

library(mgcv)
library(tabuSearch)
# http://archive.ics.uci.edu/ml/datasets/Housing
housing=read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data")
colnames(housing)=c("CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV")
 
housing$CHAS=as.factor(housing$CHAS)
housing$RAD=as.factor(housing$RAD) # Changed to factor bc only 9 unique values
summary(housing)
 
set.seed(20120823)
cv=sample(nrow(housing))
train=housing[cv[1:300],]
valid=housing[cv[301:400],]
test=housing[cv[401:nrow(housing)],]
 
ssto=sum((valid$MEDV-mean(valid$MEDV))^2)
evaluate <- function(th){
num.cols=sum(th)
if (num.cols == 0) return(0)
fo.str="MEDV ~"
cum.cols=0
for (i in 1:length(th)) {
if (th[i]>0) {
if (is.factor(train[,i])) {
fo.str=paste(fo.str,colnames(train)[i],sep=" ")
} else {
fo.str=paste(fo.str," s(",colnames(train)[i],")",sep="")
}
cum.cols=cum.cols+1
if (cum.cols<num.cols) {
fo.str=paste(fo.str,"+")
}
}
}
# colnames(train)[c(th,0)==1]
fo=as.formula(fo.str)
gam1 <- gam(fo,data=train)
pred1 <- predict(gam1,valid,se=FALSE)
sse <- sum((pred1-valid$MEDV)^2,na.rm=TRUE)
return(1-sse/ssto)
}
 
res <- tabuSearch(size = ncol(train)-1, iters = 20, objFunc = evaluate, listSize = 5,
config = rbinom(ncol(train)-1,1,.5), nRestarts = 4,verbose=TRUE)

It was able to find a subset with a 0.8678 R-squared on the validation set (and 0.8349 on the test set). The formula found was:

MEDV ~ s(CRIM) + s(INDUS) + s(NOX) + s(RM) + s(DIS) + RAD + s(TAX) + s(PTRATIO) + s(LSTAT).


Visualizing the results

The tabu search gave me a subset which it thinks is best. But I would like to get a better handle on how it derived it, or if there were lots of models with similar quality, but different variables. Or if there were variables that were always in the top performing models. I created a heat plot that showed whether or not a variable was in the regression or not at each iteration.

Below you can see which variables were included in each iteration. There are a few variables that seem to be more important because they are included in almost every iteration, like LSAT, PTRATIO,and RM. But this doesn’t tell me which iterations were the best.


In the chart below, I shaded each region by the ranking of model in that iteration. The higher ranking mean it did better. It is not easy, but we can glean a little more information from this chart. The models with RAD and DIS do significantly better than the models without them, even though they are not in every iteration. Further, the models with AGE seem a bit worse than those without it.

The R code to make these plots is below.

library(reshape)
library(ggplot2); theme_set(theme_bw())
 
tabu.df=data.frame(res$configKeep)
colnames(tabu.df)=colnames(train)[1:(ncol(train)-1)]
tabu.df$Iteration=1:nrow(tabu.df)
tabu.df$RSquared=res$eUtilityKeep
tabu.df$Rank=rank(tabu.df$RSquared)
tabu.melt=melt(tabu.df,id=c("Iteration","RSquared","Rank"))
tabu.melt$RSquared=ifelse(tabu.melt$value==1,tabu.melt$RSquared,0)
tabu.melt$Rank=ifelse(tabu.melt$value==1,tabu.melt$Rank,0)
(pHeat01 <- ggplot(tabu.melt, aes(Iteration,variable)) + geom_tile(aes(fill = value)) +
scale_fill_gradient(low = "white", high = "steelblue",guide=FALSE))
(pHeatRank <- ggplot(tabu.melt, aes(Iteration,variable)) + geom_tile(aes(fill = Rank)) +
scale_fill_gradient(low = "white", high = "steelblue"))
Created by Pretty R at inside-R.org

Comments

anonymous
Good post!

Nice approach with the heat map to go beyond just accepting the optimal solution. I agree that variables seem to be important because they are present in every iteration, but this could perhaps also mean that the tabu search was trapped in a subregion - perhaps the search move operator which only adds or deletes a single variable is not powerful enough to escape this region. Some possible remedies for exploring a larger part of the search landscape (if indeed this is a problem) might be:

- Multiple tabu searches, each with a randomized set of initial variables in the model
- Other move operators. How about finding correlation clusters among the variables, then let a move add or delete the entire cluster? Both move operators can be used in the search, using some metaheuristics to guide them.
- Use metaheuristic techniques like "ruin and recreate" during the tabu search; occasionally, "ruin" the model by removing a random subset of the variables.

One more thought. With the tabu search already in place for regression, it wouldn´t be too hard to explore some nonlinear models. You could create temp variables that are products of the original variables, then let the tabu search add or delete these as well. Do you think such an approach would have any merit in this case?
Andrew
@Anonymous I would also like to see how this does compared to the more standard methods. It will take a little effort to implement in this example, though, because these variable selection techniques are not implemented for GAMs (as far as I know). http://www.statmethods.net/stats/regression.html has a nice summary of different methods available for linear models. You are right that the whole point of using a TS is that it will be better than the other methods, so it would be good to have validation. Maybe I could do a comparison for linear models…

I described the meat of the TS algorithm in the post. If you want more info, I found this reference informative http://www.math.ntua.gr/~fouskakis/Publications/4.pdf
Andrew
@anspiess That is probably a good idea. But one reason to use Tabu Search is that it will speed up the time to find the best model. Running it many times will defeat this purpose. Running time for the parameters used was long already because GAMs take longer to train than LMs (~ 9 mins). In terms of better understanding TS, your idea would be useful. For what its worth, I ran it again with a different CV split and the chosen model was the same except it added s(ZN) and removed s(PTRATIO).
anspiess
Might it be feasible to sample training and test set a few hundred times to see how stable variable selection is?
Anonymous
I wonder how this method is better than using F test, t test or information criterion? did you compare tabusearch to other methods (for linear models)? maybe it its faster than standard methods? I would be grateful for answers:) if it its possible maybe you can describe this algorithm more closely.