Juan
Juan

Reputation: 311

Imputation methods in mice - correlation in data set. R

Im struggling with an imputation using mice. The main objective is to impute NAs (if possible by group). As the sample is a bit large to simple post here it is downloadable: https://drive.google.com/open?id=1InGJ_M7r5jwQZZRdXBO1MEbKB48gafbP

My questions are:

  1. How big of an issue is correlated data in general? What can I do to still impute the data? The data is part of an empirical research question and I don't yet know which variables to include, thus it'd be best to keep as many as possible for the time being.

  2. What methods would be more suitable than "cart" & "pmm" ? I'd like not to simply impute the mean/median....

  3. Can I somehow impute the data by "ID"

  4. Tips for debugging?

Here my code

#Start
require(mice)
require(Hmisc)
'setwd(...)
'test.df<-read.csv(...)
str(test.df)

Check for correlation: The first 2 columns contain identifiers and Year thus no need to look into.

test.df.rcorr<-rcorr(as.matrix(test.df[,-c(1:2)]))
test.df.coeff<-test.df.rcorr$r
test.df.coeff<-corrplot(test.df.coeff)

As can be seen there is some strong correlation in the data. For a simple task omit all columns with strong correlation.

#Simple example

test.df2<-test.df[,-c(4,7,10,11)]
test.df2
sum(is.na(test.df2))

Now, lets impute the test.df2 without specifying the method:

imputation.df2<-mice(test.df2, m=1, seed=123456)
imputation.df2$method
test.df2.imp<-mice::complete(imputation.df2)

Warning message:
Number of logged events: 1 


sum(is.na(test.df2.imp))

As can be seen, all the NAs are imputed. And the method used is "pmm" only.

Using the full data set, I get the following error message almost immediately:

imputation.df<-mice(test.df,m=1,seed = 66666)

 iter imp variable
  1   1  x1Error in solve.default(xtx + diag(pen)) : 
  system is computationally singular: reciprocal condition number = 1.49712e-16

Is this merely due to the correlation in the data?

Finally, my code for imputation by ID, which runs a little longer before showing this error:

test123<- lapply(split(test.df, test.df$ID), function(x) mice::complete(mice(x, m = 1 ,seed = 987654)))
Error in edit.setup(data, setup, ...) : nothing left to impute
In addition: There were 19 warnings (use warnings() to see them)
Called from: edit.setup(data, setup, ...)

I know this is a long question, and I m grateful for every little tip or hint!

Thanks a bunch!

Upvotes: 4

Views: 3036

Answers (1)

Niek
Niek

Reputation: 1624

I think the problem arises because you are dealing with longitudinal data and mice is treating the observations as independent. Longitudinal data is clustered by ID and one way to deal with this is by using a multilevel (i.e. mixed) model as your imputation model. mice has numerous options to deal with this kind of data, which you can specify in your predictor matrix and imputation method.

library(mice)
setwd("X:/My Downloads")

test.df <- read.csv("Impute.csv")

You need to specify that ID is your grouping or class variable. Unfortunately mice can only handle integer values for this variable, so you need to change it to an integer (you can always change this back after imputation).

test.df$ID <- as.integer(test.df$ID)

You can get your predictor matrix and imputation method easily with a dry run of mice (i.e. imputation with 0 iterations).

ini<-mice(test.df,maxit=0)

pred1<-ini$predictorMatrix
pred1[,"ID"]<- -2 # set ID as class variable for 2l.norm
pred1[,"year"]<- 2 # set year as a random effect, slopes differ between individuals

A value of 1 in the predictor matrix indicates that the column variable is used as a fixed effect predictor to impute the target (row) variable, and a 0 means that it is not used. -2 indicates that the variable is a class variable (your ID) and a value of 2 indicates that the variable is to be used as a random effect. For the details you need to read up on multilevel modeling, but basically you can use year as a fixed effect to specify that each individual shows the same general growth (same effect of year for each individual on any other variable) or as a random effect to model the more complicated assumption that individuals differ in growth. You can look at your data to see if the simple model sufficiently fits your observed data or if a more complicated model is necessary (i.e. do individuals grow at roughly the same rate or not).

Next, change your method to a mixed model. You have two general options: 2l.pan assumes variance is homogeneous within class, 2l.norm allows heterogeneous variance. Again, you need to read up and check your data (e.g. run a mixed model and see if residuals are roughly homogeneous). 2l.pan is the simpler model.

https://www.rdocumentation.org/packages/mice/versions/3.6.0/topics/mice.impute.2l.pan https://www.rdocumentation.org/packages/mice/versions/3.6.0/topics/mice.impute.2l.norm

# 2l.norm mixed model (heterogenous within group variance) 2l.pan (homogenous within group variance)
#Work on method
meth1<-ini$method
meth1[which(meth1 == "pmm")] <- "2l.pan"

imputation.df<-mice(test.df,m=5,seed = 66666, method = meth1, predictorMatrix = pred1)

The higher correlation between observations within an individual is taken into account with this method. Total variance is split into variance at the ID or person level and variance at the year or observation level.

Notice that I also changed the number of datasets from m = 1 to m = 5. mice is meant for computing multiple imputations, resulting in multiple datasets. Each dataset will be slightly different, and the variance between imputations is used to reflect uncertainty about the true value underlying the missing data. If you only impute one dataset you don't get this advantage.

Since the imputation models are more complicated, they take longer to run, but the error no longer occurs and your imputation method represents your data structure better (hopefully leading to more accurate imputations).

 iter imp variable
  1   1  x1  x2  x3  x4  x5  x6  x7  x8  x9  x10  x11
  1   2  x1  x2  x3  x4  x5  x6  x7  x8  x9  x10  x11
  1   3  x1  x2  x3  x4  x5  x6  x7  x8  x9  x10  x11
  1   4  x1  x2  x3  x4  x5  x6  x7  x8  x9  x10  x11
  1   5  x1  x2  x3  x4  x5  x6  x7  x8  x9  x10  x11
  2   1  x1  x2  x3  x4  x5  x6  x7  x8  x9  x10  x11
  2   2  x1  x2  x3  x4  x5

For multilevel modelling I'd suggest the book Multilevel Analyses by Snijders and Bosker. The mice manual also contains some information https://www.jstatsoft.org/article/view/v045i03

Upvotes: 7

Related Questions