Reputation: 385
I have semicontinuous data (many exact zeros and continuous positive outcomes) that I am trying to model. I have largely learned about modeling data with substantial zero mass from Zuur and Ieno's Beginner's Guide to Zero-Inflated Models in R, which makes a distinction between zero-inflated gamma models and what they call "zero-altered" gamma models, which they describe as hurdle models that combine a binomial component for the zeros and a gamma component for the positive continuous outcome. I have been exploring the use of the ziGamma
option in the glmmTMB
package and comparing the resulting coefficients to a hurdle model that I built following the instructions in Zuur's book (pages 128-129), and they do not coincide. I'm having trouble understanding why not, as I know that the gamma distribution cannot take on the value of zero, so I suppose every zero-inflated gamma model is technically a hurdle model. Can anyone illuminate this for me? See more comments about the models below the code.
library(tidyverse)
library(boot)
library(glmmTMB)
library(parameters)
### DATA
id <- rep(1:75000)
age <- sample(18:88, 75000, replace = TRUE)
gender <- sample(0:1, 75000, replace = TRUE)
cost <- c(rep(0, 30000), rgamma(n = 37500, shape = 5000, rate = 1),
sample(1:1000000, 7500, replace = TRUE))
disease <- sample(0:1, 75000, replace = TRUE)
time <- sample(30:3287, 75000, replace = TRUE)
df <- data.frame(cbind(id, disease, age, gender, cost, time))
# create binary variable for non-zero costs
df <- df %>% mutate(cost_binary = ifelse(cost > 0, 1, 0))
### HURDLE MODEL (MY VERSION)
# gamma component
hurdle_gamma <- glm(cost ~ disease + gender + age + offset(log(time)),
data = subset(df, cost > 0),
family = Gamma(link = "log"))
model_parameters(hurdle_gamma, exponentiate = T)
# binomial component
hurdle_binomial <- glm(cost_binary ~ disease + gender + age + time,
data = df, family = "binomial")
model_parameters(hurdle_binomial, exponentiate = T)
# predicted probability of use
df$prob_use <- predict(hurdle_binomial, type = "response")
# predicted mean cost for people with any cost
df_bin <- subset(df, cost_binary == 1)
df_bin$cost_gamma <- predict(hurdle_gamma, type = "response")
# combine data frames
df2 <- left_join(df, select(df_bin, c(id, cost_gamma)), by = "id")
# replace NA with 0
df2$cost_gamma <- ifelse(is.na(df2$cost_gamma), 0, df2$cost_gamma)
# calculate predicted cost for everyone
df2 <- df2 %>% mutate(cost_pred = prob_use * cost_gamma)
# mean predicted cost
mean(df2$cost_pred)
### glmmTMB with ziGamma
zigamma_model <- glmmTMB(cost ~ disease + gender + age + offset(log(time)),
family = ziGamma(link = "log"),
ziformula = ~ disease + gender + age + time,
data = df)
model_parameters(zigamma_model, exponentiate = T)
df <- df %>% predict(zigamma_model, new data = df, type = "response") # doesn't work
# "no applicable method for "predict" applied to an object of class "data.frame"
The coefficients from the gamma component of my hurdle model and the fixed effects components of the zigamma model are the same, but the SEs are different, which in my actual data has substantial implications for the significance of my predictor of interest. The coefficients on the zero-inflated model are different, and I also noticed that the z values in the binomial component are the negative inverse of those in my binomial model. I assume this has to do with my binomial model modeling the probability of presence (1 is a success) and glmmTMB presumably modeling the probability of absence (0 is a success)?
In sum, can anyone point out what I am doing wrong with the glmmTMB ziGamma model?
Upvotes: 7
Views: 8694
Reputation: 226761
The glmmTMB
package can do this:
glmmTMB(formula, family=ziGamma(link="log"), ziformula=~1, data= ...)
ought to do it. Maybe something in VGAM
as well?
To answer the questions about coefficients and standard errors:
broom.mixed::tidy
,round(1-abs(tidy(hurdle_g,component="zi")$statistic)/
abs(tidy(hurdle_binomial)$statistic),3)
## [1] 0.057 0.001 0.000 0.000 0.295
6% for the intercept, up to 30% for the effect of age ...
cost>0
) component is definitely puzzling me; it holds up if we simply implement the Gamma/log-link in glmmTMB vs glm. It's hard to know how to check which is right/what the gold standard should be for this case. I might distrust Wald p-values in this case and try to get p-values with the likelihood ratio test instead (via drop1
).In this case the model is badly misspecified (i.e. the cost is uniformly distributed, nothing like Gamma); I wonder if that could be making things harder/worse?
More recent versions of glmmTMB
can also use a Tweedie distribution for the conditional distribution of the response; this distribution allows for a combination of zeros and continuous positive values.
Upvotes: 6