Reputation: 11
I have been attempting to run a zero-inflated poisson regression on a dataframe that I have used mice() to impute missing data. My code successfully runs the multiple imputations and pools the results. However, when I attempt to summarize the pooled estimates I am not able to get the full results for the model. The zero-inflated poisson model (zeroinfl()
) has two components: one for the count portion and one for excessive zeros in my data. I am only able to display a portion of the pooled model.
library(dplyr)
library(mice)
library(pscl)
library(poissonreg)
library(countimp)
# Set the seed for reproducibility
set.seed(123)
# Simulate data with one count outcome and three variables
n <- 1000
x1 <- rnorm(n)
x2 <- rbinom(n, 1, 0.5)
x3 <- rpois(n, 2)
y <- rpois(n, 1 + exp(0.5 * x1 + 0.8 * x2 + 0.3 * x3))
# Introduce missing data to the three variables
prop_missing <- 0.2
missing_x1 <- sample(c(TRUE, FALSE), size = n,
prob = c(prop_missing, 1 - prop_missing), replace = TRUE)
missing_x2 <- sample(c(TRUE, FALSE), size = n,
prob = c(prop_missing, 1 - prop_missing), replace = TRUE)
missing_x3 <- sample(c(TRUE, FALSE), size = n,
prob = c(prop_missing, 1 - prop_missing), replace = TRUE)
x1[missing_x1] <- NA
x2[missing_x2] <- NA
x3[missing_x3] <- NA
# Create a data frame with the simulated data
dat <- data.frame(y, x1, x2, x3)
#run intital imputation
ini <- mice( dat, m = 5, maxit = 0)
pred <- ini$predictorMatrix #set predictive matrix
pred[1, ] <- c(0, 2, 2, 3) #edit predictive matrix
imp.zip <- mice(dat, m = 5, maxit = 5, method = c("", "pmm", "pmm", "zip"),
pred , seed = 1234, print = T)
# run imputation with pred and specify methods
res.zinb <- with(imp.zip, zeroinfl( y ~ x1 + x2 | x3, dist = "poisson",
link = "logit" ) )
# run the zeroinflated poisson regression on the imputed data
summary(pool(res.zinb)) #summarize and pool
Upvotes: 1
Views: 730
Reputation: 11
The issue
So I think the issue has to do with how mice::pool()
is implemented. From what I can tell, it does the following:
pool.fitlist
. (github source)pool.fitlist
receives an object of class mira, and calls summary
on it. (github source)pool.fitlist
computes pooled estimates. (github source)Then when you call summary(pool(res.zinb))
, it's calling summary.mipo
(github source) since pool(res.zinb)
is class mipo
.
The summary
function called in step (2) doesn't know how to show all components of the zeroinfl
model, which is why the logit portion of the model isn't being displayed by summary(pool(...))
broom
and broom.mixed
also don't implement a tidy summary of zeroinfl
models - you can check by loading the broom.mixed
package and running broom.mixed::get_methods()
.
My solution: description
poissonreg::tidy()
solves our step (2) issue:
> fitlist <- mice::getfit(res.zinb)
> poissonreg::tidy(fitlist[[1]], type="all")
# A tibble: 5 × 6
term type estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) count 1.18 0.0256 46.2 0
2 x1 count 0.393 0.0152 25.9 2.36e-148
3 x2 count 0.541 0.0309 17.5 7.77e- 69
4 (Intercept) zero -2.95 0.505 -5.83 5.56e- 9
5 x3 zero -1.04 0.482 -2.15 3.13e- 2
The solution I came up with basically does pool()
manually with the following steps:
zeroinfl
model in the fitlist
I defined abovemice::pool.scalar
to compute the pooled estimates. This does the same thing as in pool.fitlist
but I think this method was provided for this particular use case (see pool Rdocumentation)pool.scalar
, compute the pooled estimate, standard error, statistic, and p-value according to how summary.mipo
does it.My solution: implementation
Below is the full implementation for the above 3 steps:
# Step 1
fitlist <- mice::getfit(res.zinb)
tidylist <- lapply(fitlist, function(fit) poissonreg::tidy(fit, type = "all"))
w <- bind_rows(tidylist)
# Step 2
# Convenience wrapper function around pool.scalar.
# pool.scalar also returns a "qhat" and "u" which are vectors,
# and we don't need them. Those vectors mess up the format of
# the summary that we want to compute later.
wrap.pool.scalar <- function(estimates, variances, n, k) {
pool_res <- pool.scalar(estimates, variances, n = n, k = k)
return(as_tibble(list(
qbar = pool_res$qbar,
ubar = pool_res$ubar,
b = pool_res$b,
t = pool_res$t,
df = pool_res$df,
r = pool_res$r,
fmi = pool_res$fmi)))
}
# For each (term,type) pair, compute pooled univariate estimates using
# wrap.pool.scalar
pooled <- w %>% group_by(term, type) %>%
# n is hard-coded here but you should probably replace it with
# your n from above.
reframe(wrap.pool.scalar(estimate, std.error^2, n=1000, k=1)) %>%
mutate(estimate = qbar)
pooled
# Step 3
# Copy the pooled estimate calculations from
# https://github.com/amices/mice/blob/master/R/mipo.R#L69-L71
pooled_summary <- pooled %>% mutate(
std.error = sqrt(t),
statistic = estimate / std.error,
p.value = 2 * (pt(abs(statistic), pmax(df, 0.001), lower.tail = FALSE))) %>%
dplyr::select(term, type, estimate, std.error, statistic, df, p.value)
pooled_summary
Sanity check
We can check that for estimates that are provided by summary(pool(res.zinb))
, our pooled_summary
has the same values
> summary(pool(res.zinb))
term estimate std.error statistic df p.value
1 (Intercept) 1.1867116 0.02927839 40.53200 68.26755 1.723856e-49
2 x1 0.3844206 0.01860580 20.66134 33.92570 8.515891e-21
3 x2 0.5229170 0.03402354 15.36927 126.41155 1.032634e-30
Warning message:
In get.dfcom(object, dfcom) : Infinite sample size assumed.
> pooled_summary
# A tibble: 5 × 7
term type estimate std.error statistic df p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) count 1.19 0.0293 40.5 62.6 1.20e-46
2 (Intercept) zero -2.79 0.464 -6.02 525. 3.25e- 9
3 x1 count 0.384 0.0186 20.7 32.3 3.62e-20
4 x2 count 0.523 0.0340 15.4 110. 4.20e-29
5 x3 zero -1.07 0.433 -2.47 594. 1.38e- 2
Upvotes: 1