tnoth
tnoth

Reputation: 11

Problem with pooling estimates from multiple imputed datasets using MICE in R- Zero-Inflated Poisson

I have been attempting to run a zero-inflated poisson regression on a dataframe that I have used mice() to impute missing data. My code successfully runs the multiple imputations and pools the results. However, when I attempt to summarize the pooled estimates I am not able to get the full results for the model. The zero-inflated poisson model (zeroinfl()) has two components: one for the count portion and one for excessive zeros in my data. I am only able to display a portion of the pooled model.


library(dplyr)
library(mice)
library(pscl)
library(poissonreg)
library(countimp)

# Set the seed for reproducibility
set.seed(123)

# Simulate data with one count outcome and three variables
n <- 1000
x1 <- rnorm(n)
x2 <- rbinom(n, 1, 0.5)
x3 <- rpois(n, 2)
y <- rpois(n, 1 + exp(0.5 * x1 + 0.8 * x2 + 0.3 * x3))

# Introduce missing data to the three variables
prop_missing <- 0.2
missing_x1 <- sample(c(TRUE, FALSE), size = n, 
               prob = c(prop_missing, 1 - prop_missing), replace = TRUE)
missing_x2 <- sample(c(TRUE, FALSE), size = n, 
               prob = c(prop_missing, 1 - prop_missing), replace = TRUE)
missing_x3 <- sample(c(TRUE, FALSE), size = n, 
               prob = c(prop_missing, 1 - prop_missing), replace = TRUE)
x1[missing_x1] <- NA
x2[missing_x2] <- NA
x3[missing_x3] <- NA

# Create a data frame with the simulated data
dat <- data.frame(y, x1, x2, x3)

#run intital imputation
ini <- mice( dat, m = 5, maxit = 0)
pred <- ini$predictorMatrix #set predictive matrix
pred[1, ] <- c(0, 2, 2, 3) #edit predictive matrix

imp.zip <- mice(dat, m = 5, maxit = 5, method = c("", "pmm", "pmm", "zip"), 
                pred , seed = 1234, print = T) 
  # run imputation with pred and specify methods

res.zinb <- with(imp.zip, zeroinfl( y ~ x1 + x2 | x3, dist = "poisson", 
                 link = "logit" ) )  
  # run the zeroinflated poisson regression on the imputed data
summary(pool(res.zinb)) #summarize and pool

Upvotes: 1

Views: 730

Answers (1)

mgyliu
mgyliu

Reputation: 11

The issue

So I think the issue has to do with how mice::pool() is implemented. From what I can tell, it does the following:

  1. Calls an internal function called pool.fitlist. (github source)
  2. pool.fitlist receives an object of class mira, and calls summary on it. (github source)
  3. pool.fitlist computes pooled estimates. (github source)

Then when you call summary(pool(res.zinb)), it's calling summary.mipo (github source) since pool(res.zinb) is class mipo.

The summary function called in step (2) doesn't know how to show all components of the zeroinfl model, which is why the logit portion of the model isn't being displayed by summary(pool(...))

broom and broom.mixed also don't implement a tidy summary of zeroinfl models - you can check by loading the broom.mixed package and running broom.mixed::get_methods().

My solution: description

poissonreg::tidy() solves our step (2) issue:

> fitlist <- mice::getfit(res.zinb)
> poissonreg::tidy(fitlist[[1]], type="all")
# A tibble: 5 × 6
  term        type  estimate std.error statistic   p.value
  <chr>       <chr>    <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept) count    1.18     0.0256     46.2  0        
2 x1          count    0.393    0.0152     25.9  2.36e-148
3 x2          count    0.541    0.0309     17.5  7.77e- 69
4 (Intercept) zero    -2.95     0.505      -5.83 5.56e-  9
5 x3          zero    -1.04     0.482      -2.15 3.13e-  2

The solution I came up with basically does pool() manually with the following steps:

  1. Collect tidy tibbles for each zeroinfl model in the fitlist I defined above
  2. For each (term, type) group, call mice::pool.scalar to compute the pooled estimates. This does the same thing as in pool.fitlist but I think this method was provided for this particular use case (see pool Rdocumentation)
  3. Using the results from pool.scalar, compute the pooled estimate, standard error, statistic, and p-value according to how summary.mipo does it.

My solution: implementation

Below is the full implementation for the above 3 steps:

# Step 1
fitlist <- mice::getfit(res.zinb)
tidylist <- lapply(fitlist, function(fit) poissonreg::tidy(fit, type = "all"))
w <- bind_rows(tidylist)

# Step 2
# Convenience wrapper function around pool.scalar.
# pool.scalar also returns a "qhat" and "u" which are vectors, 
# and we don't need them. Those vectors mess up the format of
# the summary that we want to compute later.
wrap.pool.scalar <- function(estimates, variances, n, k) {
  pool_res <- pool.scalar(estimates, variances, n = n, k = k)

  return(as_tibble(list(
    qbar = pool_res$qbar, 
    ubar = pool_res$ubar, 
    b = pool_res$b, 
    t = pool_res$t, 
    df = pool_res$df, 
    r = pool_res$r, 
    fmi = pool_res$fmi)))
}

# For each (term,type) pair, compute pooled univariate estimates using 
# wrap.pool.scalar 
pooled <- w %>% group_by(term, type) %>% 
  # n is hard-coded here but you should probably replace it with 
  # your n from above.
  reframe(wrap.pool.scalar(estimate, std.error^2, n=1000, k=1)) %>% 
  mutate(estimate = qbar)
pooled

# Step 3
# Copy the pooled estimate calculations from
# https://github.com/amices/mice/blob/master/R/mipo.R#L69-L71
pooled_summary <- pooled %>% mutate(
  std.error = sqrt(t), 
  statistic = estimate / std.error,
  p.value = 2 * (pt(abs(statistic), pmax(df, 0.001), lower.tail = FALSE))) %>% 
  dplyr::select(term, type, estimate, std.error, statistic, df, p.value)
pooled_summary

Sanity check

We can check that for estimates that are provided by summary(pool(res.zinb)), our pooled_summary has the same values

> summary(pool(res.zinb))
         term  estimate  std.error statistic        df      p.value
1 (Intercept) 1.1867116 0.02927839  40.53200  68.26755 1.723856e-49
2          x1 0.3844206 0.01860580  20.66134  33.92570 8.515891e-21
3          x2 0.5229170 0.03402354  15.36927 126.41155 1.032634e-30
Warning message:
In get.dfcom(object, dfcom) : Infinite sample size assumed.
> pooled_summary
# A tibble: 5 × 7
  term        type  estimate std.error statistic    df  p.value
  <chr>       <chr>    <dbl>     <dbl>     <dbl> <dbl>    <dbl>
1 (Intercept) count    1.19     0.0293     40.5   62.6 1.20e-46
2 (Intercept) zero    -2.79     0.464      -6.02 525.  3.25e- 9
3 x1          count    0.384    0.0186     20.7   32.3 3.62e-20
4 x2          count    0.523    0.0340     15.4  110.  4.20e-29
5 x3          zero    -1.07     0.433      -2.47 594.  1.38e- 2

Upvotes: 1

Related Questions