Reputation: 13123
I have determined how to identify all unique patterns of missing observations in a data set. Now I would like to select all rows in that data set with a given pattern of missing observations. I would like to do this iteratively so that if there are n patterns of missing observations in the data set I end up with n data sets each containing only 1 pattern of missing observations.
I know how to do this, but my method is not very efficient and is not general. I am hoping to learn a more efficient and more general approach because my real data sets are much larger and more variable than in the example below.
Here is an example data set and the code I am using. I do not bother to include the code I used to create the matrix zzz from the matrix dd, but can add that code if it helps.
dd <- matrix(c(
1, 0, 1, 1,
NA, 1, 1, 0,
NA, 0, 0, 0,
NA, 1,NA, 1,
NA, 1, 1, 1,
0, 0, 1, 0,
NA, 0, 0, 0,
0,NA,NA,NA,
1,NA,NA,NA,
1, 1, 1, 1,
NA, 1, 1, 0),
nrow=11, byrow=T)
zzz <- matrix(c(
1, 1, 1, 1,
NA, 1, 1, 1,
NA, 1,NA, 1,
1,NA,NA,NA
), nrow=4, byrow=T)
for(jj in 1:dim(zzz)[1]) {
ddd <-
dd[
((dd[, 1]%in%c(0,1) & zzz[jj, 1]%in%c(0,1)) |
(is.na(dd[, 1]) & is.na(zzz[jj, 1]))) &
((dd[, 2]%in%c(0,1) & zzz[jj, 2]%in%c(0,1)) |
(is.na(dd[, 2]) & is.na(zzz[jj, 2]))) &
((dd[, 3]%in%c(0,1) & zzz[jj, 3]%in%c(0,1)) |
(is.na(dd[, 3]) & is.na(zzz[jj, 3]))) &
((dd[, 4]%in%c(0,1) & zzz[jj, 4]%in%c(0,1)) |
(is.na(dd[, 4]) & is.na(zzz[jj, 4]))),]
print(ddd)
}
The 4 resulting data sets in this example are:
a)
1 0 1 1
0 0 1 0
1 1 1 1
b)
NA 1 1 0
NA 0 0 0
NA 1 1 1
NA 0 0 0
NA 1 1 0
c)
NA 1 NA 1
d)
0 NA NA NA
1 NA NA NA
Is there a more general and more efficient method of doing the same thing? In the example above the 4 resulting data sets are not saved, but I do save them with my real data.
Thank you for any advice.
Mark Miller
Upvotes: 2
Views: 2036
Reputation: 17527
Not completely sure I understand the question, but here's a stab at it...
The first thing you want to do is figure out which elements are NA, and which aren't. For that, you can use the is.na() function.
is.na(dd)
will generate a matrix of the same size as dd containing TRUE where the value was NA, and FALSE elsewhere.
You then want to find the unique patterns in your matrix. For that, you want the unique() function, which accepts a 'margin' parameter, allowing you to find only unique rows in a matrix.
zzz <- unique(is.na(dd), margin=1)
creates a matrix similar to your zzz matrix, but you could, of course, substitute the "TRUE"s for NAs and "FALSE"s for 1's so it would be identical to your matrix.
You can then go a few directions from here to try to sort these into different datasets. Unfortunately, I think you're going to need one loop here.
results <- list()
for (r in 1:nrow(dd)){
ind <- which(apply (zzz, 1, function(x) {all(x==is.na(dd[r,]))}))
if (ind %in% names(results)){
results[[ind]] <- rbind(results[[ind]], dd[r,])
}
else{
results[[ind]] <- dd[r,]
names(results)[ind] <- ind
}
}
At that point, you have a list which contains all of the rows of dd, sorted by pattern of NAs. You'll find that the pattern expressed in row 1 of zzz will be matched with row 1 of results, and the same for the rest of the rows.
Upvotes: 1
Reputation: 32401
# Missing value patterns (TRUE=missing, FALSE=present)
patterns <- unique( is.na(dd) )
result <- list()
for( i in seq_len(nrow(patterns))) {
# Rows with this pattern
rows <- apply( dd, 1, function(u) all( is.na(u) == patterns[i,] ) )
result <- append( result, list(dd[rows,]) )
}
Upvotes: 4