Reputation: 37
I have one problem would you like to give me a hand. I tried to come up with solution, but I do not have any idea how to work it out.
Please use this to recreate my dataframe.
structure(list(A1 = c(87L, 67L, 80L, 36L, 71L, 6L, 26L, 15L,
14L, 46L, 19L, 93L, 5L, 94L), A2 = c(50L, NA, 73L, 58L, 47L,
74L, 39L, NA, NA, NA, NA, NA, NA, NA), A3 = c(NA, 38L, 10L, 41L,
NA, 66L, NA, 7L, 29L, NA, 70L, 23L, 46L, 55L)), .Names = c("A1",
"A2", "A3"), class = "data.frame", row.names = c(NA, -14L))
I have this dataframe:
A1 A2 A3
87 50 NA
67 NA 38
80 73 10
36 58 41
71 47 NA
6 74 66
26 39 NA
15 NA 7
14 NA 29
46 NA NA
19 NA 70
93 NA 23
5 NA 46
94 NA 55
What is the way to slice dataframe where we have greater or equal of 7 observations(count) per columns? So, the desired output look like this (we have obervation >= 7 per column):
A1 A3
87 NA
67 38
80 10
36 41
71 NA
6 66
26 NA
15 7
14 29
46 NA
19 70
93 23
5 46
94 55
I welcome any solution that can generalize to more columns.
Upvotes: 2
Views: 399
Reputation: 26343
Try
df1[, colSums(!is.na(df1)) >= 7]
# A1 A3
#1 87 NA
#2 67 38
#3 80 10
#4 36 41
#5 71 NA
#6 6 66
#7 26 NA
#8 15 7
#9 14 29
#10 46 NA
#11 19 70
#12 93 23
#13 5 46
#14 94 55
step by step
What you need to do first is to find out which values of your data are not missing.
!is.na(df1)
This returns a logical matrix
# A1 A2 A3
# [1,] TRUE TRUE FALSE
# [2,] TRUE FALSE TRUE
# [3,] TRUE TRUE TRUE
# [4,] TRUE TRUE TRUE
# [5,] TRUE TRUE FALSE
# [6,] TRUE TRUE TRUE
# [7,] TRUE TRUE FALSE
# [8,] TRUE FALSE TRUE
# [9,] TRUE FALSE TRUE
#[10,] TRUE FALSE FALSE
#[11,] TRUE FALSE TRUE
#[12,] TRUE FALSE TRUE
#[13,] TRUE FALSE TRUE
#[14,] TRUE FALSE TRUE
Use colSums
to find out how many observations per column are not missing
colSums(!is.na(df1))
#A1 A2 A3
#14 6 10
Apply you condition "greater or equal of 7 observations(count) per columns"
colSums(!is.na(df1)) >= 7
# A1 A2 A3
# TRUE FALSE TRUE
Finally, you need to use this vector to subset your data
df1[, colSums(!is.na(df1)) >= 7]
Turn this into a function if you need it regulary
almost_complete_cols <- function(data, min_obs) {
data[, colSums(!is.na(data)) >= min_obs, drop = FALSE]
}
almost_complete_cols(df1, 7)
Upvotes: 6