oercim
oercim

Reputation: 1848

Grouping columns with same missing value patterns in R

Let I have such a data frame(df) with missing values(NA)

df:

head1    head2  head3   head4  head5
-----    -----  -----   -----  -----
65       25     12      65     76
78       5      NA      12     NA
NA       NA     12      5      51
76       32     6       94     11
67       32     NA      1      NA

I want to cretae a list(list1) each element consists of data frames with same NA patterns.

For this example:

How can I create such a list using R? I will be very glad for any help. Thanks a lot.

@akrun, I realized that your code works fine for data frames where NA's are not common for each column. But does not work for the below data frame.

df1<-data.frame(head1=c(65,78,NA,76,67),
                head2=c(25,5,NA,32,32),
                head3=c(12,12,NA,6,NA),
                head4=c(65,12,5,94,1),
                head5=c(76,NA,51,11,NA)
)



i1 <- which(is.na(df1), arr.ind=TRUE)
l1 <- unique(split(i1[,2], i1[,1]))
i2 <- c(l1, setdiff(seq_along(df1), unlist(l1)))
l2 <- lapply(i2, function(i) df1[i]) 
l2[order(sapply(l2, function(x) colnames(x)[1]))] 

The result is:

[[1]]
  head1 head2 head3
1    65    25    12
2    78     5    12
3    NA    NA    NA
4    76    32     6
5    67    32    NA

[[2]]
  head3 head5
1    12    76
2    12    NA
3    NA    51
4     6    11
5    NA    NA

[[3]]
  head4
1    65
2    12
3     5
4    94
5     1

[[4]]
  head5
1    76
2    NA
3    51
4    11
5    NA

Upvotes: 3

Views: 142

Answers (2)

alexis_laz
alexis_laz

Reputation: 13132

Using the indices of NA values in each column, you could map each column to a "character" value:

map = sapply(df1, function(X) paste(which(is.na(X)), collapse = ";"))
map
#head1 head2 head3 head4 head5 
#  "3"   "3" "3;5"    "" "2;5"

And, then, split the columns accordingly:

split.default(df1, match(map, unique(map)))
#> str(.Last.value)
#List of 4
# $ 1:'data.frame':      5 obs. of  2 variables:
#  ..$ head1: num [1:5] 65 78 NA 76 67
#  ..$ head2: num [1:5] 25 5 NA 32 32
# $ 2:'data.frame':      5 obs. of  1 variable:
#  ..$ head3: num [1:5] 12 12 NA 6 NA
# $ 3:'data.frame':      5 obs. of  1 variable:
#  ..$ head4: num [1:5] 65 12 5 94 1
# $ 4:'data.frame':      5 obs. of  1 variable:
#  ..$ head5: num [1:5] 76 NA 51 11 NA

And performance seems tolerable for data of your actual size:

set.seed(666)
DF = as.data.frame(matrix(sample(c(NA, 1:10), 115000 * 100, TRUE), 115000, 100))
DF = DF[, sample(ncol(DF), 140, TRUE)]

system.time({
    map = sapply(DF, function(X) paste(which(is.na(X)), collapse = ";"))  
    split.default(DF, match(map, unique(map)))
})
#   user  system elapsed 
#   1.64    0.00    1.67

...unless you have ~60% NAs in each column:

set.seed(911)
DF2 = as.data.frame(replicate(100, sample(c(NA, 1:2), 115000, TRUE, c(0.6, 0.2, 0.2)), simplify = FALSE))
DF2 = DF2[, sample(ncol(DF2), 140, TRUE)]

system.time({
    map = sapply(DF2, function(X) paste(which(is.na(X)), collapse = ";"))  
    split.default(DF2, match(map, unique(map)))
})
#   user  system elapsed 
#   8.70    0.09    8.99

Upvotes: 1

akrun
akrun

Reputation: 887991

We get the row/column index of NA elements with which and specifying the arr.ind=TRUE. We split the "col" by "row", get the unique elements of index, if there are some columns missing i.e. that have no NA values, we can concatenate (c) that to the end of the list. Then, subset the dataset using the index by looping over the list (lapply(i2,..), and we can order the output list ('l2') by the first column name in each list element.

i1 <- which(is.na(df1), arr.ind=TRUE)
l1 <- unique(split(i1[,2], i1[,1]))
i2 <- c(l1, setdiff(seq_along(df1), unlist(l1)))
l2 <- lapply(i2, function(i) df1[i]) 
l2[order(sapply(l2, function(x) colnames(x)[1]))]
#[[1]]
# head1 head2
#1    65    25
#2    78     5
#3    NA    NA
#4    76    32
#5    67    32

#[[2]]
#  head3 head5
#1    12    76
#2    NA    NA
#3    12    51
#4     6    11
#5    NA    NA

#[[3]]
#  head4
#1    65
#2    12
#3     5
#4    94
#5     1

Upvotes: 1

Related Questions