Kellan Baker
Kellan Baker

Reputation: 385

Is there a difference between gamma hurdle (two-part) models and zero-inflated gamma models?

I have semicontinuous data (many exact zeros and continuous positive outcomes) that I am trying to model. I have largely learned about modeling data with substantial zero mass from Zuur and Ieno's Beginner's Guide to Zero-Inflated Models in R, which makes a distinction between zero-inflated gamma models and what they call "zero-altered" gamma models, which they describe as hurdle models that combine a binomial component for the zeros and a gamma component for the positive continuous outcome. I have been exploring the use of the ziGamma option in the glmmTMB package and comparing the resulting coefficients to a hurdle model that I built following the instructions in Zuur's book (pages 128-129), and they do not coincide. I'm having trouble understanding why not, as I know that the gamma distribution cannot take on the value of zero, so I suppose every zero-inflated gamma model is technically a hurdle model. Can anyone illuminate this for me? See more comments about the models below the code.

library(tidyverse)
library(boot)
library(glmmTMB)
library(parameters)

### DATA

id <- rep(1:75000)
age <- sample(18:88, 75000, replace = TRUE)
gender <- sample(0:1, 75000, replace = TRUE)
cost <- c(rep(0, 30000), rgamma(n = 37500, shape = 5000, rate = 1), 
          sample(1:1000000, 7500, replace = TRUE))
disease <- sample(0:1, 75000, replace = TRUE)
time <- sample(30:3287, 75000, replace = TRUE)

df <- data.frame(cbind(id, disease, age, gender, cost, time))

# create binary variable for non-zero costs

df <- df %>% mutate(cost_binary = ifelse(cost > 0, 1, 0))

### HURDLE MODEL (MY VERSION)

# gamma component

hurdle_gamma <- glm(cost ~ disease + gender + age + offset(log(time)), 
                    data = subset(df, cost > 0),
                    family = Gamma(link = "log"))

model_parameters(hurdle_gamma, exponentiate = T)

# binomial component

hurdle_binomial <-  glm(cost_binary ~ disease + gender + age + time, 
                        data = df, family = "binomial")

model_parameters(hurdle_binomial, exponentiate = T)

# predicted probability of use

df$prob_use <- predict(hurdle_binomial, type = "response")

# predicted mean cost for people with any cost

df_bin <- subset(df, cost_binary == 1)

df_bin$cost_gamma <- predict(hurdle_gamma, type = "response")

# combine data frames

df2 <- left_join(df, select(df_bin, c(id, cost_gamma)), by = "id")

# replace NA with 0

df2$cost_gamma <- ifelse(is.na(df2$cost_gamma), 0, df2$cost_gamma)

# calculate predicted cost for everyone

df2 <- df2 %>% mutate(cost_pred = prob_use * cost_gamma)

# mean predicted cost

mean(df2$cost_pred)

### glmmTMB with ziGamma

zigamma_model <- glmmTMB(cost ~ disease + gender + age + offset(log(time)),
                         family = ziGamma(link = "log"),
                         ziformula = ~ disease + gender + age + time,
                         data = df)

model_parameters(zigamma_model, exponentiate = T)

df <- df %>% predict(zigamma_model, new data = df, type = "response") # doesn't work
# "no applicable method for "predict" applied to an object of class "data.frame"

The coefficients from the gamma component of my hurdle model and the fixed effects components of the zigamma model are the same, but the SEs are different, which in my actual data has substantial implications for the significance of my predictor of interest. The coefficients on the zero-inflated model are different, and I also noticed that the z values in the binomial component are the negative inverse of those in my binomial model. I assume this has to do with my binomial model modeling the probability of presence (1 is a success) and glmmTMB presumably modeling the probability of absence (0 is a success)?

In sum, can anyone point out what I am doing wrong with the glmmTMB ziGamma model?

Upvotes: 7

Views: 8694

Answers (1)

Ben Bolker
Ben Bolker

Reputation: 226761

The glmmTMB package can do this:

glmmTMB(formula, family=ziGamma(link="log"), ziformula=~1, data= ...)

ought to do it. Maybe something in VGAM as well?


To answer the questions about coefficients and standard errors:

  • the change in sign of the binomial coefficients is exactly what you suspected (the difference between estimating the probability of 0 [glmmTMB] vs the probability of not-zero [your/Zuur's code])
  • The standard errors on the binomial part of the model are close but not identical: using broom.mixed::tidy,
round(1-abs(tidy(hurdle_g,component="zi")$statistic)/
      abs(tidy(hurdle_binomial)$statistic),3)
## [1] 0.057 0.001 0.000 0.000 0.295

6% for the intercept, up to 30% for the effect of age ...

  • the nearly twofold difference in the standard errors of the conditional (cost>0) component is definitely puzzling me; it holds up if we simply implement the Gamma/log-link in glmmTMB vs glm. It's hard to know how to check which is right/what the gold standard should be for this case. I might distrust Wald p-values in this case and try to get p-values with the likelihood ratio test instead (via drop1).

In this case the model is badly misspecified (i.e. the cost is uniformly distributed, nothing like Gamma); I wonder if that could be making things harder/worse?

More recent versions of glmmTMB can also use a Tweedie distribution for the conditional distribution of the response; this distribution allows for a combination of zeros and continuous positive values.

Upvotes: 6

Related Questions