Reputation: 305
So I have a dataset that contains a lot of missing values. I want to separate the data of different missing patterns. I found the package 'mice' which is very handy in summarizing the missing value patterns. However, when I want to select the rows with a certain missing pattern, the number of selected rows is much fewer than the count as missing pattern matrix suggests.
My code is as follows.
To get the missing pattern:
library(mice)
# md.pattern returns a matrix, I convert the matrix into a data frame with the first column as its frequency in the data frame
pattern = md.pattern(data)
freq = dimnames(pattern)[[1]][-nrow(pattern)]
pattern = data.frame(pattern[1:nrow(pattern)-1, 1:ncol(pattern)-1], row.names = NULL)
pattern$freq = freq
pattern = pattern[order(freq,decreasing = TRUE),]
However, when I try to count the missing patterns manually by a specific pattern in the pattern
. The count is much smaller.
count = 0
for (i in 1:nrow(data)){
# match the missingness by the entire row
if (all(!is.na(data[i, names(data)[1:ncol(pattern)-1]]) == test[1,1:ncol(pattern)-1])){
count = count +1
}
}
Does anyone have an idea where goes wrong? Thanks!
The data has a lot of variables(107 in total) and 70000+ observations. This code works well in the sample data nhanes
in the mice
package. But it just goes wrong in my data file.
For example:
V1 V2 V3 V4 V5
1 NA 3 5 2
NA 3 23 2 9
NA 3 90 7 5
3 3 2 34 NA
3 NA 2 1 3
4 NA 7 3 1
Upvotes: 0
Views: 1371
Reputation: 305
Anyway, I checked the original code for md.pattern
in mice
package. It's based on Schafer's prelim.norm function, not row-by-row checking missing value pattern.
I found the count
in plyr
package really does the trick. I wrote this function to return the top n
missing patterns in the dataset. x
is the data frame. It works well in my case.
library(plyr)
miss.pattern <- function(x, topn) {
# find missingness patterns, 1 represents missing
r <- 1 * data.frame(is.na(x))
pattern <- data.frame(count(r))
pattern <- pattern[order(-pattern$freq),]
return(pattern[1:topn,])
}
Upvotes: 3