Reputation: 67
I have used mice/miceadds to carry out multiple imputation. I am interested in getting a number of descriptive stats on a "pooled dataset"
Question: 1) I want to know the % of values that are above a specific value in the imputed variable. For example, how many cases have values above 5 (on a scale of 0-10), when all of the imputed datasets are aggregated. Is this feasible with MI data?
2) If #1 is not possible, is there a close alternative?
Upvotes: 0
Views: 2195
Reputation: 102
Aggregating multiply imputed datasets is never feasible.
Instead, the estimates (e.g. regression coefficients) are to be aggregated. In your case, the estimand of interest is a proportion, which may be averaged across imputations or reported for each imputation separately.
To compute a proportion for each imputation, see the reprex below. To aggregate over these estimates, use a simple mean()
call on the prop
column.
# setup env
library(dplyr, warn.conflicts = FALSE)
library(mice, warn.conflicts = FALSE)
# impute missing data
imp <- mice(nhanes, print = FALSE)
# return completed data
# split by imputation
# and compute proportion of rows where BMI exceeds 25
complete(imp, "long") %>%
group_by(.imp) %>%
summarize(
prop = mean(bmi > 25)
)
#> # A tibble: 5 × 2
#> .imp prop
#> <int> <dbl>
#> 1 1 0.6
#> 2 2 0.68
#> 3 3 0.72
#> 4 4 0.64
#> 5 5 0.56
Created on 2023-08-22 with reprex v2.0.2
Upvotes: 0
Reputation: 33
Adding to Niek's answer I would do:
impL <- complete(imp,"long")
library(stargazer)
stargazer(impL)
There you get the mean, standard deviation (Across all the datasets), min and max. I am not completely sure how to calculate the std. dev. properly.
Upvotes: 0
Reputation: 1624
Another simple way would be to create a 'long format' complete dataset and simply compute the mean, median or proportion over all imputed datasets. Since Rubin's rules state that your best estimate is the average over all imputations this should give you an appropriate outcome. The only downside is that you will not get an estimate of the standard error of these statistics.
impL <- complete(imp,"long",include = F) # long format without the original dataset
mean(impL$x) # Mean of variable x over all datasets
sum(impL$y > 5)/length(impL$y) # proportion of variable y higher than 1 over all datasets
Note that if you want an estimate of the frequency (i.e. number of cases) instead of a proportion you will need to divide by the number of imputed datasets (e.g. 5)
sum(impL$y > 5)/5
Upvotes: 1
Reputation: 7730
What you probably did is something similar to this:
# create imputed datasets
imp <- mice(nhanes, m = 5)
#perform lm on all imputed datasets
fit <- with(data = imp, exp = lm(bmi ~ hyp + chl))
#pool results
summary(pool(fit))
So you have your pooled results of the lm model. I guess yo want to know, how did the imputed data look like, that went into the model.
The imputed data actually is in the 'imp' variable. With imp$imp you would get the values that were imputed for each m. Then you can perform the analysis you need on them.
If you need completed data sets (and not only the imputed values) then you would perform
complete(imp, action ="all")
or if you only want a specific completed dataset m:
complete(imp, action =2)
E.g. you could then type
summary(complete(imp, action =2))
to get some summary statistics about the second imputed dataset.
Upvotes: 0