noname
noname

Reputation: 369

Data frame dynamic filtering

I would like to filter a data frame by a set of user defined columns and values.

Example data: test

  col1 col2 col3
1  1    a    4
2  2    b    5
3  3    c    6
4  3    c    7

If the user specified to use c("col1", "col2") to filter value c(3, "c"), I would expect the returned results to be the last two rows.

I've tried the following:

test[test[c("col1", "col2")]==c(3,"c"),]

However it gives the following results:

  col1 col2 col3
3  3    c    6
NA NA   NA   NA

Any idea why the last row is all NA? And how can I fix this?

Upvotes: 2

Views: 83

Answers (4)

biocyberman
biocyberman

Reputation: 5915

This is to extend akrun answer to a more real world example where data in the searched columns may contain NA. In those case, one will get "NA" rows in the final output which is messy and may confuse downstream analysis. In this example, I spiked columns num and col2 with NA values. Only the NA in col2 has effect because it is used in search. The root of this problem is anything that operates with NA gives NA (e.g. NA == 1 = NA), except is.na

mydf <- data.frame(num = c(1:3, NA, 5, 6), col1 = c('a', 'b', 'a', 'b', 'c', 'd'), col2 = c('A', 'b', 'A', 'B', NA, 'D'), col3 = as.character(letters)[1:6])
getrows = mydf[c("col2", "col1")] == list("A", "a") # Mixing up column orders for fun
getrows = rowSums(getrows) == 2
getrows
mydf[getrows, ] # This gives undesired NA rows
# Use which to get exact row number
which(getrows)
mydf[which(getrows),] 

Compare the output below:

> mydf <- data.frame(num = c(1:3, NA, 5, 6), col1 = c('a', 'b', 'a', 'b', 'c', 'd'), col2 = c('A', 'b', 'A', 'B', NA, 'D'), col3 = as.character(letters)[1:6])
> getrows = mydf[c("col2", "col1")] == list("A", "a") # Mixing up column orders for fun
> getrows = rowSums(getrows) == 2
> getrows
[1]  TRUE FALSE  TRUE FALSE    NA FALSE
> mydf[getrows, ] # This gives undesired NA rows
   num col1 col2 col3
1    1    a    A    a
3    3    a    A    c
NA  NA <NA> <NA> <NA>
> # Use which to get exact row number
> which(getrows)
[1] 1 3
> mydf[which(getrows),]
  num col1 col2 col3
1   1    a    A    a
3   3    a    A    c

Upvotes: 0

Flo.P
Flo.P

Reputation: 371

Another dplyrversion with tidyeval

library(dplyr)
column_equals <- function(df, ...) {
  conditions <- quos(...)
  df %>%  
    filter(!!!conditions)
}

test %>%  
  column_equals(col1 == 3 & col2 == "c")

Upvotes: 0

J_F
J_F

Reputation: 10372

Here my dplyr solution:

library(dplyr)
test %>% 
  rowwise() %>% 
  mutate(con = if_else(col1 %in% 3 & col2 %in% "c",TRUE, FALSE)) %>% 
  filter(con == TRUE) %>% 
  select(-con)  %>% 
  ungroup

Upvotes: 0

akrun
akrun

Reputation: 887301

We need to get the rowindex correctly. It could be either a vector of logical or numeric index. In the comparison (it is better to use list instead of c as we don't want to mix classes) we get a logical matrix that needs to be reduced to a vector. One option is rowSums and check whether the sum of each row is equal to 2 i.e. the number of columns used for comparison and subset the rows

test[rowSums(test[c("col1", "col2")] == list(3, 'c'))==2,]
#  col1 col2 col3
#3    3    c    6
#4    3    c    7

Now, we look at the difference in approach using c and list

test[c("col1", "col2")]==c(3,"c")
#  col1  col2
#1 FALSE FALSE
#2 FALSE FALSE
#3  TRUE FALSE
#4 FALSE  TRUE

Here, it is comparing elements by recycling the elements one after the another i.e. for 'col1', 1 is compared with 3, then 2 with 'c', followed by recycling of the vector i.e. 3 with 3 and again the next 3 with 'c'. It follows the same way with the next column.

test[c("col1", "col2")]== list(3,"c")
#   col1  col2
#1 FALSE FALSE
#2 FALSE FALSE
#3  TRUE  TRUE  #note the change
#4  TRUE  TRUE

While, here, it compares the first column elements with the first element of list i.e. the list element is replicated or recycled followed by comparing the second column with second list element

Note that there are 8 elements i.e. 4 per each column. So, it happens that there are 2 TRUE elements in the first case and 4 TRUE in the second case and there are 8 elements, however we have only 4 rows, so when the second column of logical matrix is not having rows, thus it creates the NA row for the TRUE value

test[test[c("col1", "col2")]==c(3,"c"),]
#   col1 col2 col3
#3     3    c    6
#NA   NA <NA>   NA

Similarly, there are 2 TRUE in each column, creating again double the number of NA rows

test[test[c("col1", "col2")]==list(3,"c"),]   
#   col1 col2 col3
#3       3    c    6
#4       3    c    7 
#NA     NA <NA>   NA
#NA.1   NA <NA>   NA

Suppose, we compare also the 3rd column, then there will be an extra NA row

test[test==list(3,"c", 5),]
#     col1 col2 col3
#3       3    c    6
#4       3    c    7
#NA     NA <NA>   NA
#NA.1   NA <NA>   NA
#NA.2   NA <NA>   NA

Upvotes: 4

Related Questions