Reputation: 2290
It appears that the mi
package has had a pretty big rewrite at some point within the past couple of years.
The "old" way of doing things is well-outlined in the following tutorial: http://thomasleeper.com/Rcourse/Tutorials/mi.html
The "new" way of doing things (sticking with Leeper's simulation demo) looks something like this:
#load mi
library(mi)
#set seed
set.seed(10)
#simulate some data (with some observations missing)
x1 <- runif(100, 0, 5)
x2 <- rnorm(100)
y <- 2*x1 + 20*x2 + rnorm(100)
mydf <- cbind.data.frame(x1, x2, y)
mydf$x1[sample(1:nrow(mydf), 20, FALSE)] <- NA
mydf$x2[sample(1:nrow(mydf), 10, FALSE)] <- NA
# Convert to a missing_data.frame
mydf_mdf <- missing_data.frame(mydf)
# impute
mydf_imp <- mi(mydf_mdf)
Although function names have changed, this is actually pretty similar to the "old" way of doing things.
The biggest change (from my vantage) is the replacement of the following "old" functions
lm.mi(formula, mi.object, ...)
glm.mi(formula, mi.object, family = gaussian, ...)
bayesglm.mi(formula, mi.object, family = gaussian, ...)
polr.mi(formula, mi.object, ...)
bayespolr.mi(formula, mi.object, ...)
lmer.mi(formula, mi.object, rescale=FALSE, ...)
glmer.mi(formula, mi.object, family = gaussian, rescale=FALSE, ...)
.
Previously, a user could compute a model for each imputed dataset using one of these functions and then pool the results using mi.pooled()
(or coef.mi()
if we are following the Leeper example).
In the current version of mi
(I have v1.0 installed), these last steps appear to have been combined into a single function, pool()
. The pool()
function appears to read the family and link function that was assigned to a variable during the imputation process above and then estimate a model with bayesglm
using the specified formula as shown below.
# run models on imputed data and pool the results
summary(pool(y ~ x1 + x2, mydf_imp))
##
## Call:
## pool(formula = y ~ x1 + x2, data = mydf_imp)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.98754 -0.40923 0.03393 0.46734 2.13848
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.34711 0.25979 -1.336 0.215
## x1 2.07806 0.08738 23.783 1.46e-13 ***
## x2 19.90544 0.11068 179.844 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.7896688)
##
## Null deviance: 38594.916 on 99 degrees of freedom
## Residual deviance: 76.598 on 97 degrees of freedom
## AIC: 264.74
##
## Number of Fisher Scoring iterations: 7
This looks like we are coming close to recovering our simulated beta values (2 and 20). In other words, it is behaving as expected.
Let's take a slightly larger set of data with a naively simulated random effect just for the sake of getting a grouping variable.
mydf2 <- data.frame(x1 = rep(runif(100, 0, 5), 20)
,x2 = rep(rnorm(100, 0, 2.5), 20)
,group_var = rep(1:20, each = 100)
,noise = rep(rnorm(100), 20))
mydf2$y <- 2*mydf2$x1 + 20*mydf2$x2 + mydf2$noise
mydf2$x1[sample(1:nrow(mydf2), 200, FALSE)] <- NA
mydf2$x2[sample(1:nrow(mydf2), 100, FALSE)] <- NA
# Convert to a missing_data.frame
mydf2_mdf <- missing_data.frame(mydf2)
show(mydf2_mdf)
## Object of class missing_data.frame with 2000 observations on 5 variables
##
## There are 4 missing data patterns
##
## Append '@patterns' to this missing_data.frame to access the corresponding pattern for every observation or perhaps use table()
##
## type missing method model
## x1 continuous 200 ppd linear
## x2 continuous 100 ppd linear
## group_var continuous 0 <NA> <NA>
## noise continuous 0 <NA> <NA>
## y continuous 0 <NA> <NA>
##
## family link transformation
## x1 gaussian identity standardize
## x2 gaussian identity standardize
## group_var <NA> <NA> standardize
## noise <NA> <NA> standardize
## y <NA> <NA> standardize
Since missing_data.frame()
appears to be intepreting group_var
as continuous, I use the change()
function from mi
to reassign to "un"
for "unordered categorical" and then proceed as above.
mydf2_mdf <- change(mydf2_mdf, y = "group_var", what = "type", to = "un" )
# impute
mydf2_imp <- mi(mydf2_mdf)
Now, unless version 1.0 of mi
has removed the functionality of previous versions (i.e. functionality available with lmer.mi
and glmer.mi
), I would assume that the addition of a random effect in the formula should point pool()
to the appropriate lme4
function. However, the initial error message suggests that this is not the case.
# run models on imputed data and pool the results
summary(pool(y ~ x1 + x2 + (1|group_var), mydf2_imp))
## Warning in Ops.factor(1, group_var): '|' not meaningful for factors
## Warning in Ops.factor(1, group_var): '|' not meaningful for factors
## Error in if (prior.scale[j] < min.prior.scale) {: missing value where TRUE/FALSE needed
Following my warning message and extracting the integers out of my factor does get me an estimate, but the results suggest that pool()
is still estimating a fixed-effect model with bayesglm
and holding my attempted random-effect constant.
summary(pool(y ~ x1 + x2 + (1|as.numeric(as.character(group_var))), mydf2_imp))
##
## Call:
## pool(formula = y ~ x1 + x2 + (1 | as.numeric(as.character(group_var))),
## data = mydf2_imp)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.93633 -0.69923 0.01073 0.56752 2.12167
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 1.383e-01 2.596e+02 0.001
## x1 1.995e+00 1.463e-02 136.288
## x2 2.000e+01 8.004e-03 2499.077
## 1 | as.numeric(as.character(group_var))TRUE -3.105e-08 2.596e+02 0.000
## Pr(>|t|)
## (Intercept) 1
## x1 <2e-16 ***
## x2 <2e-16 ***
## 1 | as.numeric(as.character(group_var))TRUE 1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.8586836)
##
## Null deviance: 5384205.2 on 1999 degrees of freedom
## Residual deviance: 1713.9 on 1996 degrees of freedom
## AIC: 5377
##
## Number of Fisher Scoring iterations: 4
My questions are:
mi
?, and Upvotes: 4
Views: 3514
Reputation: 4881
Just to provide an alternative, there's a package that focuses quite a bit on MI for mixed-effects models as well as pooling the results obtained from it (mitml
, find it here).
Using the package is pretty straightforward. It relies on the packages pan
and jomo
for imputation, but it can also handle input from other MI packages (?as.mitml.list
).
Pooling estimates from a mixed-effects model is mostly automatized and included in the testEstimates
function.
require(mitml)
require(lme4)
data(studentratings)
# impute example data using 'pan'
fml <- ReadDis + SES ~ ReadAchiev + (1|ID)
imp <- panImpute(studentratings, formula=fml, n.burn=1000, n.iter=100, m=5)
implist <- mitmlComplete(imp, print=1:5)
# fit model using lme4
fit.lmer <- with(implist, lmer(SES ~ (1|ID)))
# pool results using 'Rubin's rules'
testEstimates(fit.lmer, var.comp=TRUE)
Output:
# Call:
# testEstimates(model = fit.lmer, var.comp = TRUE)
# Final parameter estimates and inferences obtained from 5 imputed data sets.
# Estimate Std.Error t.value df p.value RIV FMI
# (Intercept) 46.988 1.119 41.997 801.800 0.000 0.076 0.073
# Estimate
# Intercept~~Intercept|ID 38.272
# Residual~~Residual 298.446
# ICC|ID 0.114
# Unadjusted hypothesis test as appropriate in larger samples.
Upvotes: 6
Reputation: 4990
You can specify the FUN
argument to the pool()
function to change the estimator. In your case, it would be summary(pool(y ~ x1 + x2 + (1|as.numeric(as.character(group_var))), data = mydf2_imp, FUN = lmer))
. That may or may not actually work, but it is legal syntax. If that fails, then you can use the complete
function to created completed data.frames, call lmer
on each, and average the results yourself, which would be something like
dfs <- complete(mydf2_imp)
estimates <- lapply(dfs, FUN = lme4, formula = y ~ x1 + x2 +
(1|as.numeric(as.character(group_var))))
rowMeans(sapply(estimates, FUN = fixef))
Upvotes: 1