TobiasS
TobiasS

Reputation: 31

Imputing based on specific columns

I'm about to do imputation for missing values and I use the mice-package. I need to do imputation based on specific column content. So basically, I have 24 columns that are used to measure 4 Latent Variables (using the plspm-package). I wish to impute N/A's based on specific column content. So for cols 1-6 I wish to impute NAs in those specific columns based only on the content within these 6. (and so forth for cols 7-12, 13-18 and 19-24).

I hope it makes sense for you guys.

My data structure is:

p1  p2  p3  p4  p5  p6  l1  l2  l3  l4  l5  l6
4   3   5   4   5   N/A 2   1   4   5   1   N/A
4   4   1   3   1   2   1   1   1   1   1   1
5   4   5   4   4   4   4   4   5   5   4   4
5   4   5   5   4   5   4   4   N/A 5   4   4
5   5   5   5   5   5   3   2   5   5   2   2
4   3   4   3   3   3   3   2   3   4   3   2
5   4   5   5   3   4   4   1   5   5   5   4
5   5   5   5   5   5   5   3   4   5   3   4
4   4   4   4   3   N/A 4   4   5   4   3   3
5   4   4   4   3   2   1   3   2   5   1   1
4   4   4   4   5   5   3   4   5   5   3   3
4   3   2   N/A 1   2   N/A 1   2   N/A 1   N/A
3   3   4   4   3   2   1   3   3   3   1   3
5   3   4   4   4   2   3   4   4   4   3   3
4   4   4   5   2   2   2   2   2   2   3   3
5   4   4   4   4   4   4   4   5   5   4   3
4   3   3   3   5   2   2   2   4   4   1   1
5   4   5   4   5   3   1   1   5   5   2   3
4   3   1   3   4   4   2   1   4   3   2   3
4   3   1   4   3   1   2   1   4   4   3   2
3   3   5   4   5   1   2   2   4   5   3   2
4   4   5   3   5   5   2   2   3   4   2   3
4   4   2   3   2   3   2   2   3   4   2   2
5   5   5   5   5   5   4   3   3   3   3   3
5   5   5   5   5   4   4   N/A 5   5   N/A N/A

So I guess it's essentially splitting data into 4 blocks and then imputing. I read about the blocks()-function in the help(mice), but I'm not sure I can actually use that for this specific task.

The code i've been using so far is:

temp_pmm <- mice(data_predict,
                  m = 3,
                  maxit = 10,
                  method = "pmm", 
                  seed = 2374)

But the way I understand the package, it imputes based on entire row content (so my latent variable constructs overlap, which I am trying to mitigate).

Hope you can help me out and I appreciate any help. Thanks in advance!

Tobias

Upvotes: 1

Views: 1158

Answers (1)

TobiasS
TobiasS

Reputation: 31

So Dominix' suggestion of simply running separate imputations seems to be the right way to go. Thanks a lot!

For any future reference, this is how I worked it out:

test_pmm_firstv <- mice(data_predict[,c(1:6)],
                      m = 10,
                      maxit = 20,
                      method = "pmm",
                      seed = 127493)

test_pmm_secondv <- mice(data_predict[,c(7:12)],
                      m = 10,
                      maxit = 20,
                      method = "pmm",
                      seed = 1239754111)

test_pmm_thirdv <- mice(data_predict[,c(13:18)],
                      m = 10,
                      maxit = 20,
                      method = "pmm",
                      seed = 1238603)

test_pmm_fourthv <- mice(data_predict[,c(19:24)],
                      m = 10,
                      maxit = 20,
                      method = "pmm",
                      seed = 356811)

data_pmm_firstv <- mice::complete(test_pmm_firstv, 1)
data_pmm_secondv <- mice::complete(test_pmm_secondv, 1)
data_pmm_thirdv <- mice::complete(test_pmm_thirdv, 1)
data_pmm_fourthv <- mice::complete(test_pmm_fourthv, 1)

data_fixed <- as.data.frame(cbind(data_pmm_firstv, data_pmm_secondv, data_pmm_thirdv, data_pmm_fourthv))

anyNA(data_fixed)
[1] FALSE

Upvotes: 1

Related Questions