Reputation: 369
I would like to filter a data frame by a set of user defined columns and values.
Example data: test
col1 col2 col3
1 1 a 4
2 2 b 5
3 3 c 6
4 3 c 7
If the user specified to use c("col1", "col2")
to filter value c(3, "c")
, I would expect the returned results to be the last two rows.
I've tried the following:
test[test[c("col1", "col2")]==c(3,"c"),]
However it gives the following results:
col1 col2 col3
3 3 c 6
NA NA NA NA
Any idea why the last row is all NA? And how can I fix this?
Upvotes: 2
Views: 83
Reputation: 5915
This is to extend akrun answer to a more real world example where data in the searched columns may contain NA
. In those case, one will get "NA" rows in the final output which is messy and may confuse downstream analysis. In this example, I spiked columns num
and col2
with NA
values. Only the NA
in col2
has effect because it is used in search. The root of this problem is anything that operates with NA
gives NA
(e.g. NA == 1 = NA
), except is.na
mydf <- data.frame(num = c(1:3, NA, 5, 6), col1 = c('a', 'b', 'a', 'b', 'c', 'd'), col2 = c('A', 'b', 'A', 'B', NA, 'D'), col3 = as.character(letters)[1:6])
getrows = mydf[c("col2", "col1")] == list("A", "a") # Mixing up column orders for fun
getrows = rowSums(getrows) == 2
getrows
mydf[getrows, ] # This gives undesired NA rows
# Use which to get exact row number
which(getrows)
mydf[which(getrows),]
Compare the output below:
> mydf <- data.frame(num = c(1:3, NA, 5, 6), col1 = c('a', 'b', 'a', 'b', 'c', 'd'), col2 = c('A', 'b', 'A', 'B', NA, 'D'), col3 = as.character(letters)[1:6])
> getrows = mydf[c("col2", "col1")] == list("A", "a") # Mixing up column orders for fun
> getrows = rowSums(getrows) == 2
> getrows
[1] TRUE FALSE TRUE FALSE NA FALSE
> mydf[getrows, ] # This gives undesired NA rows
num col1 col2 col3
1 1 a A a
3 3 a A c
NA NA <NA> <NA> <NA>
> # Use which to get exact row number
> which(getrows)
[1] 1 3
> mydf[which(getrows),]
num col1 col2 col3
1 1 a A a
3 3 a A c
Upvotes: 0
Reputation: 371
Another dplyr
version with tidyeval
library(dplyr)
column_equals <- function(df, ...) {
conditions <- quos(...)
df %>%
filter(!!!conditions)
}
test %>%
column_equals(col1 == 3 & col2 == "c")
Upvotes: 0
Reputation: 10372
Here my dplyr
solution:
library(dplyr)
test %>%
rowwise() %>%
mutate(con = if_else(col1 %in% 3 & col2 %in% "c",TRUE, FALSE)) %>%
filter(con == TRUE) %>%
select(-con) %>%
ungroup
Upvotes: 0
Reputation: 887301
We need to get the rowindex correctly. It could be either a vector
of logical
or numeric
index. In the comparison (it is better to use list
instead of c
as we don't want to mix classes) we get a logical matrix
that needs to be reduced to a vector. One option is rowSums
and check whether the sum of each row is equal to 2 i.e. the number of columns used for comparison and subset the rows
test[rowSums(test[c("col1", "col2")] == list(3, 'c'))==2,]
# col1 col2 col3
#3 3 c 6
#4 3 c 7
Now, we look at the difference in approach using c
and list
test[c("col1", "col2")]==c(3,"c")
# col1 col2
#1 FALSE FALSE
#2 FALSE FALSE
#3 TRUE FALSE
#4 FALSE TRUE
Here, it is comparing elements by recycling the elements one after the another i.e. for 'col1', 1 is compared with 3, then 2 with 'c', followed by recycling of the vector i.e. 3 with 3 and again the next 3 with 'c'. It follows the same way with the next column.
test[c("col1", "col2")]== list(3,"c")
# col1 col2
#1 FALSE FALSE
#2 FALSE FALSE
#3 TRUE TRUE #note the change
#4 TRUE TRUE
While, here, it compares the first column elements with the first element of list
i.e. the list
element is replicated or recycled followed by comparing the second column with second list
element
Note that there are 8 elements i.e. 4 per each column. So, it happens that there are 2 TRUE elements in the first case and 4 TRUE in the second case and there are 8 elements, however we have only 4 rows, so when the second column of logical matrix is not having rows, thus it creates the NA row for the TRUE value
test[test[c("col1", "col2")]==c(3,"c"),]
# col1 col2 col3
#3 3 c 6
#NA NA <NA> NA
Similarly, there are 2 TRUE in each column, creating again double the number of NA rows
test[test[c("col1", "col2")]==list(3,"c"),]
# col1 col2 col3
#3 3 c 6
#4 3 c 7
#NA NA <NA> NA
#NA.1 NA <NA> NA
Suppose, we compare also the 3rd column, then there will be an extra NA row
test[test==list(3,"c", 5),]
# col1 col2 col3
#3 3 c 6
#4 3 c 7
#NA NA <NA> NA
#NA.1 NA <NA> NA
#NA.2 NA <NA> NA
Upvotes: 4