Marie-Lu
Marie-Lu

Reputation: 41

Multiple imputation in R (mice) - How do I test imputation runs?

I work with a data set of 171 observations of 55 variables with 35 variables having NA's that I want to impute with the mice function:

imp_Data <- mice(Data,m=5,maxit=50,meth='pmm',seed=500)

 imp_Data$imp

Now, having the 5 imputation runs, I don't know how I can test and decide which of the 5 imputations is the best to choose for my data set.

Checking for that topic I found again and again scripts using the with() function with a linear model and then the pool() function:

fit <- with(imp_Data, lm(a ~ b + c + d + e))

 combine <- pool(fit)

But I didn't understand for what this linear model is needed and how it helps me to find the best imputation run.

Can someone please tell me in a simple way how I can do a test of the 5 imputations / how I can decide which one to choose?

Thanks for helping!

Upvotes: 4

Views: 4113

Answers (2)

Steffen Moritz
Steffen Moritz

Reputation: 7730

mice is a multiple imputation package. Multiple Imputation itself is not really a imputation algorithm - it is rather a concept how to impute data, while also accounting for the uncertainty that comes along with the imputation.

If you just want one imputed dataset, you can use Single Imputation packages like VIM (e.g. the function irmi() or kNN() ). Also the packages imputeR and missForest are good for Single Imputation. Thy output you one single imputed dataset.

If you still want to use mice and just want to have 1 imputed dataset at the end, you can take one of these 5 datasets.

There is a deeper reason, why multiple imputation creates multiple imputed datasets. The idea behind this is, that the imputation itself introduces bias. You can not really claim that a NA value you impute is e.g. exactly 5. The more correct answer from a Bayesian point of view would be, the missing value is likely somewhere between 3 and 7. So if you just set it to 5 you introduce bias.

Multiple Imputation solves this problem by sampling from different probability distributions and in the end comes up with multiple imputed datasets, which are basically all possible solutions.

The main idea of multiple imputation is now to take these five datasets, treat each as possible solution and you perform your analysis on each one!

Afterwards your analysis results (and not the imputed datasets!) would be pooled together.

So the with() and the pooling() part have nothing to do with creating one dataset, they are needed for combining the five analysis results back together.

The linear model is one form of analysis a lot of people apply to data. (they want to analyze relations of some variables to a response variable). In order to get unbiased results, this analysis is done 5 times and then results are combined.

So if you don't want to use a linear model anyway you don't need this. Because this part has to do with the analysis of the data and not with the imputation.

Upvotes: 4

Closed Limelike Curves
Closed Limelike Curves

Reputation: 190

Now, having the 5 imputation runs, I don't know how I can test and decide which of the 5 imputations is the best to choose for my data set.

The answer is you can't--if you try, everything will explode (not converge to the correct values) and you will get bad answers.

The reason you need multiple imputation--why you can't just pick a single value--is because when you use multiple imputation, you're basically simulating random data points to create simulated data. Then, by looking at all of the simulations, you get an idea of what the distribution of the data looks like.

Taking the average simulation value and plugging it into your regression is not the same as doing the regression multiple times based on your simulations. When you average the imputed datasets, you no longer model the uncertainty inherent in the imputation process. If you average enough imputed datasets, the uncertainty goes to 0, but even just averaging 5 of them is enough to reduce the uncertainty by ~55%.

The value of your missing variable is not the imputed value. The value is unknown, so we have to think about this in terms of probabilities--e.g. there's a 10% chance the real value is less than y_1, a 20% chance it's less than y_2, etc. Multiple imputation accounts for this uncertainty.

On the other hand, using a single imputation (what you get when you average several answers) is basically pretending that you know the real value. If you impute a single value y_1, that's like saying "the value is y_1"--not "the value could be y_1."

Asymptotically, multiple imputation gives correct answers. The procedure of averaging several imputations, or just using a single imputation, doesn't.

Upvotes: 0

Related Questions