Reputation: 883
How do I filter a dataset by a specific value that can occur anywhere in the data frame and not necessarily under any one column or row ?
Suppose I have a data frame that is like this.
id gender group Student_Math_1 Student_Math_2 Student_Read_1 Student_Read_2
46 M Red 23 45 37 56
46 M Red 34 36 33 78
46 M Red 56 63 58 NA
62 F Blue 59 NA NA 68
62 F Blue NA 68 87 73
38 M Red 78 57 NA 65
38 M Red NA 75 54 NA
17 F Blue 74 NA 56 72
17 F Blue 75 61 NA 79
17 F Blue NA 74 43 81
And I am trying to subset this data frame such that I retain all the rows and columns that contain the value 68
regardless of where it occurs within the data frame.
The final output would be
id gender group Student_Math_1 Student_Math_2 Student_Read_1 Student_Read_2
62 F Blue 59 NA NA 68
62 F Blue NA 68 87 73
Any tips or suggestions are welcome. Thanks in advance.
df = structure(list(id = c(46, 46, 46, 62, 62, 38, 38, 17, 17, 17),
gender = structure(c(2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L,
1L), .Label = c("F", "M"), class = "factor"), group = structure(c(2L,
2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L), .Label = c("Blue", "Red"
), class = "factor"), Student_Math_1 = c(23, 34, 56, 59,
NA, 78, NA, 74, 75, NA), Student_Math_2 = c(45, 36, 63, NA,
68, 57, 75, NA, 61, 74), Student_Read_1 = c(37, 33, 58, NA,
87, NA, 54, 56, NA, 43), Student_Read_2 = c(56, 78, NA, 68,
73, 65, NA, 72, 79, 81)), .Names = c("id", "gender", "group",
"Student_Math_1", "Student_Math_2", "Student_Read_1", "Student_Read_2"
), row.names = c(NA, -10L), class = "data.frame")
Upvotes: 1
Views: 149
Reputation: 73265
How about:
## use data from "Student_Math_1" column to "Student_Read_2" column
df[rowSums(df[4:7] == 68, na.rm = TRUE) > 0, ]
# id gender group Student_Math_1 Student_Math_2 Student_Read_1 Student_Read_2
#4 62 F Blue 59 NA NA 68
#5 62 F Blue NA 68 87 73
Note, df[4:7] == 68
returns a logical matrix (with NA
), and we take rowSums
with na.rm = TRUE
. During such arithmetic operation, TRUE
is seen 1 and FALSE
is seen 0.
Follow-up
Thanks to Ben Bolker for reminding this more readable solution, and you certainly need to grasp it, if you are learning R:
df[apply(df[4:7] == 68, 1L, any, na.rm = TRUE), ]
which applies rowwise any
(with na.rm = TRUE
). I can't remember where I compared these two in terms of performance. But I wouldn't bother making a quick experiment:
library(microbenchmark)
## For simplicity / neatness, I generate a logical matrix `X` without `NA`
X <- matrix(sample(c(TRUE, FALSE), 2000 * 10, replace = TRUE), 2000)
## also measuring 989's solution
microbenchmark(ZL = rowSums(X) > 0,
Ben = apply(X, 1L, any),
"989" = unique(which(X, arr.ind = T)[,1]))
#Unit: microseconds
# expr min lq mean median uq max neval cld
# ZL 144.24 149.76 183.3516 164.86 172.48 2077.80 100 a
# Ben 5610.08 5730.78 6003.0660 5779.20 5861.46 8021.72 100 c
# 989 1571.72 1639.58 2033.4224 1664.78 1721.18 5339.48 100 b
Upvotes: 3
Reputation: 12937
Alternatively,
df[unique(which(df==68, arr.ind = T)[,1]),]
# id gender group Student_Math_1 Student_Math_2 Student_Read_1 Student_Read_2
#5 62 F Blue NA 68 87 73
#4 62 F Blue 59 NA NA 68
In this case, you don't need to care about the position of the columns or where NA
s are appeared. unique
is used in case 68 is appeared more than once per row.
Upvotes: 2