Reputation: 1848
Let I have such a data frame(df) with missing values(NA)
df:
head1 head2 head3 head4 head5
----- ----- ----- ----- -----
65 25 12 65 76
78 5 NA 12 NA
NA NA 12 5 51
76 32 6 94 11
67 32 NA 1 NA
I want to cretae a list(list1) each element consists of data frames with same NA patterns.
For this example:
How can I create such a list using R? I will be very glad for any help. Thanks a lot.
@akrun, I realized that your code works fine for data frames where NA's are not common for each column. But does not work for the below data frame.
df1<-data.frame(head1=c(65,78,NA,76,67),
head2=c(25,5,NA,32,32),
head3=c(12,12,NA,6,NA),
head4=c(65,12,5,94,1),
head5=c(76,NA,51,11,NA)
)
i1 <- which(is.na(df1), arr.ind=TRUE)
l1 <- unique(split(i1[,2], i1[,1]))
i2 <- c(l1, setdiff(seq_along(df1), unlist(l1)))
l2 <- lapply(i2, function(i) df1[i])
l2[order(sapply(l2, function(x) colnames(x)[1]))]
The result is:
[[1]]
head1 head2 head3
1 65 25 12
2 78 5 12
3 NA NA NA
4 76 32 6
5 67 32 NA
[[2]]
head3 head5
1 12 76
2 12 NA
3 NA 51
4 6 11
5 NA NA
[[3]]
head4
1 65
2 12
3 5
4 94
5 1
[[4]]
head5
1 76
2 NA
3 51
4 11
5 NA
Upvotes: 3
Views: 142
Reputation: 13132
Using the indices of NA
values in each column, you could map each column to a "character" value:
map = sapply(df1, function(X) paste(which(is.na(X)), collapse = ";"))
map
#head1 head2 head3 head4 head5
# "3" "3" "3;5" "" "2;5"
And, then, split
the columns accordingly:
split.default(df1, match(map, unique(map)))
#> str(.Last.value)
#List of 4
# $ 1:'data.frame': 5 obs. of 2 variables:
# ..$ head1: num [1:5] 65 78 NA 76 67
# ..$ head2: num [1:5] 25 5 NA 32 32
# $ 2:'data.frame': 5 obs. of 1 variable:
# ..$ head3: num [1:5] 12 12 NA 6 NA
# $ 3:'data.frame': 5 obs. of 1 variable:
# ..$ head4: num [1:5] 65 12 5 94 1
# $ 4:'data.frame': 5 obs. of 1 variable:
# ..$ head5: num [1:5] 76 NA 51 11 NA
And performance seems tolerable for data of your actual size:
set.seed(666)
DF = as.data.frame(matrix(sample(c(NA, 1:10), 115000 * 100, TRUE), 115000, 100))
DF = DF[, sample(ncol(DF), 140, TRUE)]
system.time({
map = sapply(DF, function(X) paste(which(is.na(X)), collapse = ";"))
split.default(DF, match(map, unique(map)))
})
# user system elapsed
# 1.64 0.00 1.67
...unless you have ~60% NA
s in each column:
set.seed(911)
DF2 = as.data.frame(replicate(100, sample(c(NA, 1:2), 115000, TRUE, c(0.6, 0.2, 0.2)), simplify = FALSE))
DF2 = DF2[, sample(ncol(DF2), 140, TRUE)]
system.time({
map = sapply(DF2, function(X) paste(which(is.na(X)), collapse = ";"))
split.default(DF2, match(map, unique(map)))
})
# user system elapsed
# 8.70 0.09 8.99
Upvotes: 1
Reputation: 887991
We get the row/column index of NA elements with which
and specifying the arr.ind=TRUE
. We split
the "col" by "row", get the unique
elements of index, if there are some columns missing i.e. that have no NA values, we can concatenate (c
) that to the end of the list
. Then, subset
the dataset using the index by looping over the list
(lapply(i2,..
), and we can order
the output list
('l2') by the first column name in each list
element.
i1 <- which(is.na(df1), arr.ind=TRUE)
l1 <- unique(split(i1[,2], i1[,1]))
i2 <- c(l1, setdiff(seq_along(df1), unlist(l1)))
l2 <- lapply(i2, function(i) df1[i])
l2[order(sapply(l2, function(x) colnames(x)[1]))]
#[[1]]
# head1 head2
#1 65 25
#2 78 5
#3 NA NA
#4 76 32
#5 67 32
#[[2]]
# head3 head5
#1 12 76
#2 NA NA
#3 12 51
#4 6 11
#5 NA NA
#[[3]]
# head4
#1 65
#2 12
#3 5
#4 94
#5 1
Upvotes: 1