WeakLearner
WeakLearner

Reputation: 938

Regression Subset Algorithm in R

I want to build a regression subset algorithm in R for a 'beta regression' model. There is a package betareg in R which fits beta regressions, and what I am interested in is the model that maximises the 'log likelihood'.

Basically, this works by picking the best k factor regression model, for k = 1,2,...,p where p is the number of variables you have.

For example, if i have x_1, x_2, x_3 as my variables, and y as my response. I want to have something that does:

step 1: Find best 1 factor model

mod1 <- betareg(y~x_1, data = test)
mod1.sum <- summary(mod1)

mod2 <- betareg(y~x_2, data = test)
mod2.sum <- summary(mod2)

mod3 <- betareg(y~x_3, data = test)
mod3.sum <- summary(mod3)

now that i have fit all the models, I want to compare the log likelihood of each:

 likelihoods <- c( mod1.sum$loglik, mod2.sum$loglik, mod3.sum$loglik)
which.max(likelihoods)

Step 2: find the best factor to add to the best 1 factor model, let's assume x_1 was the best in the previous step. Then in this step we compare the model with x_1 and x_2, to the model with x_1 and x_3 choosing the one with the biggest loglikelihood.

Step 3: taking the best two variables as a given, find the third variable contributing the largest increase to log likelihood.

Step 4: Return the best 1 factor model, best 2 factor model, ..., best p factor model, the factors included and their corresponding log likelihoods.

I am struggling to do this efficiently when p is large, say around 40

Upvotes: 1

Views: 1055

Answers (3)

Achim Zeileis
Achim Zeileis

Reputation: 17193

To the best of my knowledge there is no dedicated efficient implementation of best-subset selection for beta regression (in R or otherwise). However, there are some generic implementations that provide approximate solutions to this, e.g., based on genetic algorithms such as the kofnGA package (on CRAN and published in JSS). See the example below. (To use forward search in stead of best-subset selection, see my other answer.)

Alternatively, you could use a (generalized) linear model that approximates what betareg does and use the subset selection for that. For example, you could logit-transform the response (i.e., qlogis(y)) and then run best-subset selection using a linear regression via leaps (CRAN) or lmSubsets (R-Forge). Or you could use a GLM with family = quasibinomial and use glmulti (CRAN, JSS). And then you could use the best-subset result from that approximate model and employ it in a beta regression. Of course, this will not give you the best beta regression result but it might be a useful starting point for further analyses.

Therefore, back to the direct genetic algorithm for beta regression. To illustrate how this can be accomplished with kofnGA we first load packages and example data:

library("betareg")
library("kofnGA")
data("FoodExpenditure", package = "betareg")

Then, we construct a list with the response variable y and a regressor matrix x. Note that we omit the intercept here in order to force it into the model later (i.e., the intercept should not be subject to selection).

fe_data <- list(
  y = with(FoodExpenditure, food/income),
  x = model.matrix(~ income + persons, data = FoodExpenditure)[, -1]
)

In addition to the two regressors set up above we now add 40 random noise variables to the regressor matrix

fe_data$x <- cbind(fe_data$x,
  matrix(rnorm(40 * nrow(fe_data$x)), ncol = 40))
colnames(fe_data$x)[3:42] <- paste0("x", 1:40)

Now we can employ kofnGA to select the best model with 2 regressors out of the potential 42 regressors (plus the always-included intercept). As kofnGA minimizes an obejctive we use the negative log-likelihood provided by betareg. The workhorse function betareg.fit instead of betareg is used to avoid unnecessary formula parsing etc.

nll <- function(v, data) -betareg.fit(x = cbind(1, data$x[, v]),
  y = data$y)$loglik

Finally, we run the genetic algorithm just for 100 generations to save some computation time in this short example:

set.seed(1)
ga <- kofnGA(n = 42, k = 2, OF = nll, data = fe_data, ngen = 100)

The resulting output is

summary(ga)
## Genetic algorithm search, 100 generations
## Number of unique solutions in the final population: 1 
## 
## Objective function values:
##                      average   minimum
## Initial population -36.56597 -41.74583
## Final population   -45.33351 -45.33351
## 
## Best solution (found at generation 1):
## 1 2 

Thus, in this very simple artificial setup the genetic algorithm indeed picks the first 2 regressors (from the real data) and not any of the irrelevant random 40 regressors we added. So we could now go ahead and refit a proper beta regression model with the regressors

colnames(fe_data$x)[ga$bestsol]
## [1] "income"  "persons"

etc. Note that the beta regression employed here simply uses a fixed precision parameter (with log link). If you want a variable dispersion then you would need to modify nll correspondingly.

Upvotes: 1

Achim Zeileis
Achim Zeileis

Reputation: 17193

In addition to my other answer that shows how to do best-subset selection for beta regression with kofnGA, I'm including an example how to do forward selection by hand.

We again start by loading the package and data:

library("betareg")
data("FoodExpenditure", package = "betareg")

I'm also setting up lists with the response plus all regressors (including 40 random ones. (Note that unlike in the other I'm keeping the intercept in x which is more convenient here.)

fe_data <- list(
  y = with(FoodExpenditure, food/income),
  x = model.matrix(~ income + persons, data = FoodExpenditure)
)
set.seed(123)
fe_data$x <- cbind(fe_data$x,
  matrix(rnorm(40 * nrow(fe_data$x)), ncol = 40))
colnames(fe_data$x)[4:43] <- paste0("x", 1:40)

Then we set up again a function for the negative log-likelihood (but do not need to include the intercept manually because it is still in x).

nll <- function(v, data) -betareg.fit(x = data$x[, v, drop = FALSE],
  y = data$y)$loglik

Then we store the index of all possible regressors vall and initialize our search with the intercept (v <- 1) and the corresponding negative log-likelihood (n).

vall <- 1:ncol(fe_data$x)
v <- 1
n <- nll(v, data = fe_data)

And then we iterate our forward search for 15 iterations (to avoid numerical instabilities on this small data set for higher numbers of variables). We always select that additional variable that decreases the negative log-likelihood most:

for(i in 1:15) {
  vi <- vall[-v]
  ni <- sapply(vi, function(vii) nll(v = c(v, vii), data = fe_data))
  v <- c(v, vi[which.min(ni)])
  n <- c(n, ni[which.min(ni)])
}

The sequence in which the variables are chosen is the following. Note that the real regressors are picked first, followed by the random noise regressors. (But try to set.seed(1) above which will include random regressors before the real ones...)

colnames(fe_data$x)[v]
##  [1] "(Intercept)" "income"      "persons"     "x28"         "x18"        
##  [6] "x29"         "x22"         "x11"         "x5"          "x8"         
## [11] "x38"         "x24"         "x13"         "x23"         "x36"        
## [16] "x16"        

The corresponding decrease in the negative log-likelihood and associated BIC can be visualized as:

m <- seq_along(v)
plot(m, n, type = "b",
  xlab = "Number of regressors", ylab = "Log-likelihood")
plot(m, n + log(nrow(fe_data$x)) * (m + 1), type = "b",
  xlab = "Number of regressors", ylab = "BIC")

So this would indeed pick the model with the three real regressors as the best BIC model.

Upvotes: 1

Adam Quek
Adam Quek

Reputation: 7153

Here's an alternate solution without using betareg. The output is similar and for consideration for your issues.

Here's the dataset I'd used:

set.seed(12345)
dat <- data.frame(y=runif(50), x_1=runif(50), x_2=runif(50), x_3=runif(50))

Using leaps library to create a list of all possible combinations:

library(leaps)
subs<-regsubsets(y~., data=dat, nbest=10, nvmax=100, really.big=T)
subs<-summary(subs)$which[,-1]
all.mods<-lapply(1:nrow(subs), function(x)paste("y", paste(names(which(subs[x,])), collapse="+"), sep="~"))

all.mods

[[1]]
[1] "y~x_2"

[[2]]
[1] "y~x_1"

[[3]]
[1] "y~x_3"

[[4]]
[1] "y~x_2+x_3"

[[5]]
[1] "y~x_1+x_2"

[[6]]
[1] "y~x_1+x_3"

[[7]]
[1] "y~x_1+x_2+x_3"

Running linear regression for all models:

all.lm<-lapply(all.mods, function(x)lm(as.formula(x), data=dat))

Check logLikihood for each model:

lapply(all.lm, logLik)

[[1]]
'log Lik.' -7.051835 (df=3)

[[2]]
'log Lik.' -9.288776 (df=3)

[[3]]
'log Lik.' -9.334048 (df=3)

[[4]]
'log Lik.' -6.904604 (df=4)

[[5]]
'log Lik.' -7.051584 (df=4)

[[6]]
'log Lik.' -9.215915 (df=4)

[[7]]
'log Lik.' -6.888849 (df=5)

Upvotes: 0

Related Questions