joemienko
joemienko

Reputation: 2290

Can I pool imputed random effect model estimates using the mi package?

It appears that the mi package has had a pretty big rewrite at some point within the past couple of years.

The "old" way of doing things is well-outlined in the following tutorial: http://thomasleeper.com/Rcourse/Tutorials/mi.html

The "new" way of doing things (sticking with Leeper's simulation demo) looks something like this:

#load mi
library(mi)
#set seed
set.seed(10)
#simulate some data (with some observations missing)
x1 <- runif(100, 0, 5)
x2 <- rnorm(100)
y <- 2*x1 + 20*x2 + rnorm(100)
mydf <- cbind.data.frame(x1, x2, y)
mydf$x1[sample(1:nrow(mydf), 20, FALSE)] <- NA
mydf$x2[sample(1:nrow(mydf), 10, FALSE)] <- NA

# Convert to a missing_data.frame
mydf_mdf <- missing_data.frame(mydf)

# impute
mydf_imp <- mi(mydf_mdf)

Although function names have changed, this is actually pretty similar to the "old" way of doing things.

The biggest change (from my vantage) is the replacement of the following "old" functions

lm.mi(formula, mi.object, ...)

glm.mi(formula, mi.object, family = gaussian, ...)

bayesglm.mi(formula, mi.object, family = gaussian, ...)

polr.mi(formula, mi.object, ...)

bayespolr.mi(formula, mi.object, ...)

lmer.mi(formula, mi.object, rescale=FALSE, ...)

glmer.mi(formula, mi.object, family = gaussian, rescale=FALSE, ...).

Previously, a user could compute a model for each imputed dataset using one of these functions and then pool the results using mi.pooled() (or coef.mi() if we are following the Leeper example).

In the current version of mi (I have v1.0 installed), these last steps appear to have been combined into a single function, pool(). The pool() function appears to read the family and link function that was assigned to a variable during the imputation process above and then estimate a model with bayesglm using the specified formula as shown below.

# run models on imputed data and pool the results
summary(pool(y ~ x1 + x2, mydf_imp))

## 
## Call:
## pool(formula = y ~ x1 + x2, data = mydf_imp)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.98754  -0.40923   0.03393   0.46734   2.13848  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.34711    0.25979  -1.336    0.215    
## x1           2.07806    0.08738  23.783 1.46e-13 ***
## x2          19.90544    0.11068 179.844  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.7896688)
## 
##     Null deviance: 38594.916  on 99  degrees of freedom
## Residual deviance:    76.598  on 97  degrees of freedom
## AIC: 264.74
## 
## Number of Fisher Scoring iterations: 7

This looks like we are coming close to recovering our simulated beta values (2 and 20). In other words, it is behaving as expected.

Let's take a slightly larger set of data with a naively simulated random effect just for the sake of getting a grouping variable.

mydf2 <- data.frame(x1 = rep(runif(100, 0, 5), 20)
                   ,x2 = rep(rnorm(100, 0, 2.5), 20)
                   ,group_var = rep(1:20, each = 100)
                   ,noise = rep(rnorm(100), 20))

mydf2$y <- 2*mydf2$x1 + 20*mydf2$x2 + mydf2$noise

mydf2$x1[sample(1:nrow(mydf2), 200, FALSE)] <- NA
mydf2$x2[sample(1:nrow(mydf2), 100, FALSE)] <- NA

# Convert to a missing_data.frame
mydf2_mdf <- missing_data.frame(mydf2)

show(mydf2_mdf)

## Object of class missing_data.frame with 2000 observations on 5 variables
## 
## There are 4 missing data patterns
## 
## Append '@patterns' to this missing_data.frame to access the corresponding pattern for every observation or perhaps use table()
## 
##                 type missing method  model
## x1        continuous     200    ppd linear
## x2        continuous     100    ppd linear
## group_var continuous       0   <NA>   <NA>
## noise     continuous       0   <NA>   <NA>
## y         continuous       0   <NA>   <NA>
## 
##             family     link transformation
## x1        gaussian identity    standardize
## x2        gaussian identity    standardize
## group_var     <NA>     <NA>    standardize
## noise         <NA>     <NA>    standardize
## y             <NA>     <NA>    standardize

Since missing_data.frame() appears to be intepreting group_var as continuous, I use the change() function from mi to reassign to "un" for "unordered categorical" and then proceed as above.

mydf2_mdf <- change(mydf2_mdf, y = "group_var", what = "type", to = "un"  )

# impute
mydf2_imp <- mi(mydf2_mdf)

Now, unless version 1.0 of mi has removed the functionality of previous versions (i.e. functionality available with lmer.mi and glmer.mi), I would assume that the addition of a random effect in the formula should point pool() to the appropriate lme4 function. However, the initial error message suggests that this is not the case.

# run models on imputed data and pool the results
summary(pool(y ~ x1 + x2 + (1|group_var), mydf2_imp))
## Warning in Ops.factor(1, group_var): '|' not meaningful for factors
## Warning in Ops.factor(1, group_var): '|' not meaningful for factors
## Error in if (prior.scale[j] < min.prior.scale) {: missing value where TRUE/FALSE needed

Following my warning message and extracting the integers out of my factor does get me an estimate, but the results suggest that pool() is still estimating a fixed-effect model with bayesglm and holding my attempted random-effect constant.

summary(pool(y ~ x1 + x2 + (1|as.numeric(as.character(group_var))), mydf2_imp))

## 
## Call:
## pool(formula = y ~ x1 + x2 + (1 | as.numeric(as.character(group_var))), 
##     data = mydf2_imp)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.93633  -0.69923   0.01073   0.56752   2.12167  
## 
## Coefficients:
##                                               Estimate Std. Error  t value
## (Intercept)                                  1.383e-01  2.596e+02    0.001
## x1                                           1.995e+00  1.463e-02  136.288
## x2                                           2.000e+01  8.004e-03 2499.077
## 1 | as.numeric(as.character(group_var))TRUE -3.105e-08  2.596e+02    0.000
##                                             Pr(>|t|)    
## (Intercept)                                        1    
## x1                                            <2e-16 ***
## x2                                            <2e-16 ***
## 1 | as.numeric(as.character(group_var))TRUE        1    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.8586836)
## 
##     Null deviance: 5384205.2  on 1999  degrees of freedom
## Residual deviance:    1713.9  on 1996  degrees of freedom
## AIC: 5377
## 
## Number of Fisher Scoring iterations: 4

My questions are:

  1. Is it possible to easily generate pooled random effects estimates using mi?, and
  2. If yes, how?

Upvotes: 4

Views: 3514

Answers (2)

SimonG
SimonG

Reputation: 4881

Just to provide an alternative, there's a package that focuses quite a bit on MI for mixed-effects models as well as pooling the results obtained from it (mitml, find it here).

Using the package is pretty straightforward. It relies on the packages pan and jomo for imputation, but it can also handle input from other MI packages (?as.mitml.list).

Pooling estimates from a mixed-effects model is mostly automatized and included in the testEstimates function.

require(mitml)
require(lme4)

data(studentratings)

# impute example data using 'pan'
fml <- ReadDis + SES ~ ReadAchiev + (1|ID)
imp <- panImpute(studentratings, formula=fml, n.burn=1000, n.iter=100, m=5)

implist <- mitmlComplete(imp, print=1:5)

# fit model using lme4
fit.lmer <- with(implist, lmer(SES ~ (1|ID)))

# pool results using 'Rubin's rules'
testEstimates(fit.lmer, var.comp=TRUE)

Output:

# Call:

# testEstimates(model = fit.lmer, var.comp = TRUE)

# Final parameter estimates and inferences obtained from 5 imputed data sets.

#              Estimate Std.Error   t.value        df   p.value       RIV       FMI 
# (Intercept)    46.988     1.119    41.997   801.800     0.000     0.076     0.073 

#                         Estimate 
# Intercept~~Intercept|ID   38.272 
# Residual~~Residual       298.446 
# ICC|ID                     0.114 

# Unadjusted hypothesis test as appropriate in larger samples. 

Upvotes: 6

Ben Goodrich
Ben Goodrich

Reputation: 4990

You can specify the FUN argument to the pool() function to change the estimator. In your case, it would be summary(pool(y ~ x1 + x2 + (1|as.numeric(as.character(group_var))), data = mydf2_imp, FUN = lmer)). That may or may not actually work, but it is legal syntax. If that fails, then you can use the complete function to created completed data.frames, call lmer on each, and average the results yourself, which would be something like dfs <- complete(mydf2_imp) estimates <- lapply(dfs, FUN = lme4, formula = y ~ x1 + x2 + (1|as.numeric(as.character(group_var)))) rowMeans(sapply(estimates, FUN = fixef))

Upvotes: 1

Related Questions