Reputation: 194
I've run a regression to replace missing data in a dataset and want to compare it to the results of using the 'mice' package by Stef va Buuren
I'm referencing this link here on cross-validated Link to Post
I'm also reading This which is similar syntax and usage.
My code is:
imp <- mice(without_response, method = "norm.predict", m = 1)
#Impute data
imp_with_mice <- complete(imp) # Store data
When I output:
imp_with_mice[impute_here,]
to get the rows that need imputing, none of the values are replaced. I originally had '?' where the missing data was. I've now tried 'NA' as a string and then NA without quote marks to resemble the cv post.
In no instance can I get mice to replace my 16 column 7 values with anything at all.
Please help me with usage.
These are examples of rows where I would expect a variable to be replaced:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
24 1057013 8 4 5 1 2 NA 7 3 1
41 1096800 6 6 6 9 6 NA 7 8 1
Also, I get this display when it runs.
iter imp variable 1 1 2 1 3 1 4 1 5 1
Warning message:
“Number of logged events: 1”
Additional info:
str(without_response[impute_here,])
'data.frame': 16 obs. of 10 variables:
$ V1 : int 1057013 1096800 1183246 1184840 1193683 1197510
1241232 169356 432809 563649 ...
$ V2 : int 8 6 1 1 1 5 3 3 3 8 ...
$ V3 : int 4 6 1 1 1 1 1 1 1 8 ...
$ V4 : int 5 6 1 3 2 1 4 1 3 8 ...
$ V5 : int 1 9 1 1 1 1 1 1 1 1 ...
$ V6 : int 2 6 1 2 3 2 2 2 2 2 ...
$ V7 : chr NA NA NA NA ...
$ V8 : int 7 7 2 2 1 3 3 3 2 6 ...
$ V9 : int 3 8 1 1 1 1 1 1 1 10 ...
$ V10: int 1 1 1 1 1 1 1 1 1 1 ...
summary(without_response[impute_here,])
V1 V2 V3 V4
Min. : 61634 Min. :1.000 Min. :1.000 Min. :1.000
1st Qu.: 595517 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
Median :1057040 Median :3.000 Median :1.000 Median :2.500
Mean : 857578 Mean :3.375 Mean :2.438 Mean :2.875
3rd Qu.:1187051 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:4.250
Max. :1241232 Max. :8.000 Max. :8.000 Max. :8.000
V5 V6 V7 V8
Min. :1.000 Min. :1.000 Length:16 Min. :1.000
1st Qu.:1.000 1st Qu.:2.000 Class :character 1st Qu.:2.000
Median :1.000 Median :2.000 Mode :character Median :2.500
Mean :1.812 Mean :2.438 Mean :3.125
3rd Qu.:1.000 3rd Qu.:2.000 3rd Qu.:3.250
Max. :9.000 Max. :7.000 Max. :7.000
V9 V10
Min. : 1.00 Min. :1
1st Qu.: 1.00 1st Qu.:1
Median : 1.00 Median :1
Mean : 2.75 Mean :1
3rd Qu.: 3.00 3rd Qu.:1
Max. :10.00 Max. :1
is.na(without_response[impute_here,])
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
24 FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
41 FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
140 FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
146 FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
159 FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
165 FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
236 FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
250 FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
276 FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
293 FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
295 FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
298 FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
316 FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
322 FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
412 FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
618 FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
Upvotes: 0
Views: 10947
Reputation: 1624
In my understanding of your question and dataset (as I said before a reproducible example would be helpful), I suspect that the problem is that V7 only has NA
and constant values. This is what the logged events warn you about. mice
cannot impute such variables as it has no basis to make predictions about what the missing values should be.
mice(... method = "norm.predict")
works by imputing plausible values based on linear regression between the variable with missing values and other variables in your dataset. It uses existing data to make predictions about plausible values. However, since V7
is a constant it has no variance and no co-variance with other variables. As such, predictions are not possible. Multiple imputation cannot be used in this situation. There is no reasonable imputation that can be made apart from assuming that all values in V7
are constant (i.e. mean imputation). Be aware that there are some major downsides to this if this assumption is invalid. Your other best option is pairwise deletion.
Upvotes: 6