Statistically Significant

Andrew Landgraf's Blog

Finding the Best Subset of a GAM Using Tabu Search and Visualizing It in R

| Comments

Finding the best subset of variables for a regression is a very common task in statistics and machine learning. There are statistical methods based on asymptotic normal theory that can help you decide whether to add or remove a variable at a time. The problem with this is that it is a greedy approach and you can easily get stuck in a local mode. Another approach is to look at all possible subsets of the variables and see which one maximizes an objective function (accuracy on a test set, for example).

Heuristics are required if the number of variables, p, gets large (>40) because the number of possible subsets is equal to 2^p, excluding interactions. One method that works well is called tabu search (TS). It is a more general optimization method, but in the case of finding the best subset, it is similar to stepwise methods. At any point, it will try to improve the objective function the most by adding or removing one variable. One difference is that TS will not add or remove variables that have been recently added or removed. This avoids the optimization from getting stuck at a local maximum. There are more details to the algorithm that you can read up on if you would like.

There is a tabuSearch package in R that implements this algorithm. I wanted to find a best subset regression using generalized additive models (GAMs) of the package mgcv. An issue arose, because you need to specify which terms are smooth and which are not in a formula to use mgcv. The general form of the formula looks like:
gam(y ~ s(x1) + s(x2) + x3, data=train)
where x1 and x2 are smooth terms and x3 is treated like it is in a linear model.

My data set has a combination of continuous variables, which I want to treat as smooth, and categorical variables, which I want to treat as factors. For each variable in the subset, I need to identify whether it is a factor or not, and then creating a string of the variable with or without s().

For example, th is a vector of 0’s and 1’s which indicate whether each variable (column) is in the regression. I used the housing data set from UCI ML repository. With 13 explanatory variables, there are 8192 possible main effects models. I am predicting MEDV, so I start the formula string with “MEDV ~”. Then I loop through each element of th. If it is 1 then I want to add it to the formula. I check if it is a factor and if so I just add the name of the variable plus a “+”. If it is continuous, I add the column name with ”s()” around it. Finally I convert the string to a formula using formula(). I can plug in this formula into the gam function.

num.cols=sum(th)
fo.str="MEDV ~"
cum.cols=0
for (i in 1:length(th)) {
if (th[i]>0) {
if (is.factor(train[,i])) {
fo.str=paste(fo.str,colnames(train)[i],sep=" ")
} else {
fo.str=paste(fo.str," s(",colnames(train)[i],")",sep="")
}
cum.cols=cum.cols+1
if (cum.cols<num.cols) {
fo.str=paste(fo.str,"+")
}
}
}
fo=as.formula(fo.str)
Created by Pretty R at inside-R.org

For evaluating the subsets, I split the data into training, validation, and testing. I trained the subset on the training data set and measured the R-squared on the prediction of the validation set. The full code can be found below.

library(mgcv)
library(tabuSearch)
# http://archive.ics.uci.edu/ml/datasets/Housing
housing=read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data")
colnames(housing)=c("CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV")
 
housing$CHAS=as.factor(housing$CHAS)
housing$RAD=as.factor(housing$RAD) # Changed to factor bc only 9 unique values
summary(housing)
 
set.seed(20120823)
cv=sample(nrow(housing))
train=housing[cv[1:300],]
valid=housing[cv[301:400],]
test=housing[cv[401:nrow(housing)],]
 
ssto=sum((valid$MEDV-mean(valid$MEDV))^2)
evaluate <- function(th){
num.cols=sum(th)
if (num.cols == 0) return(0)
fo.str="MEDV ~"
cum.cols=0
for (i in 1:length(th)) {
if (th[i]>0) {
if (is.factor(train[,i])) {
fo.str=paste(fo.str,colnames(train)[i],sep=" ")
} else {
fo.str=paste(fo.str," s(",colnames(train)[i],")",sep="")
}
cum.cols=cum.cols+1
if (cum.cols<num.cols) {
fo.str=paste(fo.str,"+")
}
}
}
# colnames(train)[c(th,0)==1]
fo=as.formula(fo.str)
gam1 <- gam(fo,data=train)
pred1 <- predict(gam1,valid,se=FALSE)
sse <- sum((pred1-valid$MEDV)^2,na.rm=TRUE)
return(1-sse/ssto)
}
 
res <- tabuSearch(size = ncol(train)-1, iters = 20, objFunc = evaluate, listSize = 5,
config = rbinom(ncol(train)-1,1,.5), nRestarts = 4,verbose=TRUE)

It was able to find a subset with a 0.8678 R-squared on the validation set (and 0.8349 on the test set). The formula found was:

MEDV ~ s(CRIM) + s(INDUS) + s(NOX) + s(RM) + s(DIS) + RAD + s(TAX) + s(PTRATIO) + s(LSTAT).


Visualizing the results

The tabu search gave me a subset which it thinks is best. But I would like to get a better handle on how it derived it, or if there were lots of models with similar quality, but different variables. Or if there were variables that were always in the top performing models. I created a heat plot that showed whether or not a variable was in the regression or not at each iteration.

Below you can see which variables were included in each iteration. There are a few variables that seem to be more important because they are included in almost every iteration, like LSAT, PTRATIO,and RM. But this doesn’t tell me which iterations were the best.


In the chart below, I shaded each region by the ranking of model in that iteration. The higher ranking mean it did better. It is not easy, but we can glean a little more information from this chart. The models with RAD and DIS do significantly better than the models without them, even though they are not in every iteration. Further, the models with AGE seem a bit worse than those without it.

The R code to make these plots is below.

library(reshape)
library(ggplot2); theme_set(theme_bw())
 
tabu.df=data.frame(res$configKeep)
colnames(tabu.df)=colnames(train)[1:(ncol(train)-1)]
tabu.df$Iteration=1:nrow(tabu.df)
tabu.df$RSquared=res$eUtilityKeep
tabu.df$Rank=rank(tabu.df$RSquared)
tabu.melt=melt(tabu.df,id=c("Iteration","RSquared","Rank"))
tabu.melt$RSquared=ifelse(tabu.melt$value==1,tabu.melt$RSquared,0)
tabu.melt$Rank=ifelse(tabu.melt$value==1,tabu.melt$Rank,0)
(pHeat01 <- ggplot(tabu.melt, aes(Iteration,variable)) + geom_tile(aes(fill = value)) +
scale_fill_gradient(low = "white", high = "steelblue",guide=FALSE))
(pHeatRank <- ggplot(tabu.melt, aes(Iteration,variable)) + geom_tile(aes(fill = Rank)) +
scale_fill_gradient(low = "white", high = "steelblue"))
Created by Pretty R at inside-R.org

Comments

anonymous
Good post!

Nice approach with the heat map to go beyond just accepting the optimal solution. I agree that variables seem to be important because they are present in every iteration, but this could perhaps also mean that the tabu search was trapped in a subregion - perhaps the search move operator which only adds or deletes a single variable is not powerful enough to escape this region. Some possible remedies for exploring a larger part of the search landscape (if indeed this is a problem) might be:

- Multiple tabu searches, each with a randomized set of initial variables in the model
- Other move operators. How about finding correlation clusters among the variables, then let a move add or delete the entire cluster? Both move operators can be used in the search, using some metaheuristics to guide them.
- Use metaheuristic techniques like "ruin and recreate" during the tabu search; occasionally, "ruin" the model by removing a random subset of the variables.

One more thought. With the tabu search already in place for regression, it wouldn´t be too hard to explore some nonlinear models. You could create temp variables that are products of the original variables, then let the tabu search add or delete these as well. Do you think such an approach would have any merit in this case?
Andrew
@Anonymous I would also like to see how this does compared to the more standard methods. It will take a little effort to implement in this example, though, because these variable selection techniques are not implemented for GAMs (as far as I know). http://www.statmethods.net/stats/regression.html has a nice summary of different methods available for linear models. You are right that the whole point of using a TS is that it will be better than the other methods, so it would be good to have validation. Maybe I could do a comparison for linear models…

I described the meat of the TS algorithm in the post. If you want more info, I found this reference informative http://www.math.ntua.gr/~fouskakis/Publications/4.pdf
Andrew
@anspiess That is probably a good idea. But one reason to use Tabu Search is that it will speed up the time to find the best model. Running it many times will defeat this purpose. Running time for the parameters used was long already because GAMs take longer to train than LMs (~ 9 mins). In terms of better understanding TS, your idea would be useful. For what its worth, I ran it again with a different CV split and the chosen model was the same except it added s(ZN) and removed s(PTRATIO).
anspiess
Might it be feasible to sample training and test set a few hundred times to see how stable variable selection is?
Anonymous
I wonder how this method is better than using F test, t test or information criterion? did you compare tabusearch to other methods (for linear models)? maybe it its faster than standard methods? I would be grateful for answers:) if it its possible maybe you can describe this algorithm more closely.

Comments