mkreisel
mkreisel

Reputation: 205

bestglm R package error using Poisson regression

I'm trying to use the bestglm package in R to fit a Poisson regression model. Here is my code:

bestmodel <- bestglm(Xy, family = poisson, IC ="BIC")

and the output is

Morgan-Tatar search since family is non-gaussian. Error in dimnames(Bestq) <- list(c("BICq1", "BICq2"), c("q1", "q2", "selected k")) : length of 'dimnames' [2] not equal to array extent In addition: There were 50 or more warnings (use warnings() to see the first 50)

Theres nothing special about the data frame Xy, except that it does not strictly contain integers, which is causing the final collection of warnings. Notably, the code

bestmodel <- bestglm(Xy, IC ="BIC")

runs just fine.

Is this a bug, or is there something I can do to get it to run?

Upvotes: 1

Views: 2005

Answers (1)

Ben Bolker
Ben Bolker

Reputation: 226861

tl;dr For technical reasons, you're going to have trouble doing likelihood-based or IC-based model selection for Poisson GLMs with non-integer data (the likelihoods aren't necessarily well-defined). Working around the problem via quasi-likelihoods is possible: I have a hack at a solution below, but if you value your conclusions you should definitely double-check that the quasi-AIC formula is correct!

I think non-integer values are indeed the problem. I can reproduce it as follows:

library("bestglm")

set.seed(101)
## y must be **LAST** column (not documented clearly!)
Xy <- data.frame(
    x1=rnorm(100),x2=rnorm(100),
    y=rpois(100,2))

Make the response variable non-integer:

Xy2 <- transform(Xy,y=y+0.5)

These all work:

bestglm(Xy)
bestglm(Xy,IC="BIC")
bestglm(Xy,IC="BIC", family=poisson)

This fails:

bestglm(Xy2,IC="BIC", family=poisson)
##  Error in dimnames(Bestq) <- 
## list(c("BICq1", "BICq2"), c("q1", "q2", "selected k")) : 
##    length of 'dimnames' [2] not equal to array extent

I'm not sure whether this should count as a bug, since Poisson GLMs aren't really intended to handle non-integer data (but I can appreciate that there are sometimes reasons to do so). I think the proximal problem might be that the BIC (and other information criteria based on the log-likelihood) is infinite, because a non-integer value is technically impossible under a Poisson model, so the likelihood is zero ...

BIC(glm(y~1,data=Xy2,family=poisson))
## Inf

In principle you could get around this by computing a quasi-likelihood/quasi-AIC, but R's built-in quasipoisson family was written by purists who don't believe that such a thing makes sense:

BIC(glm(y~1,data=Xy2,family=quasipoisson))
## NA

I think the best solution would be to write your own my_quasipoisson family that replaces

quasipoisson()$aic
function (y, n, mu, wt, dev) 
NA

with a sensible quasi-likelihood.

update: here's a workaround. It's a hack at a sensible quasi-AIC expression (which is also used by BIC(), I think). It comes with even less warranty than usual.

my_QP <- function(link="log") {
    QP <- quasipoisson(link=link)
    QP$aic <- function (y, n, mu, wt, dev) {
        nobs <- length(y)
        disp <- dev/n
        dev0 <- -2*sum((-mu + y*log(mu) -lfactorial(y))*wt/disp)
        dev0 + 2  ## extra penalty parameter
    }
    QP
}

bestglm(Xy2,family=my_QP)

It might make sense to cross-check to see what packages like MuMIn and AICcmodavg (which as I recall allow quasi-AIC-based model averaging and selection) do ...

Upvotes: 2

Related Questions