KKW
KKW

Reputation: 385

How to filter with different conditions in different rows?

Data structure:

library(tidyverse)

df <- tribble(
  ~"group", ~"word",
  1,"apple",
  1,"orange",
  1,"apple cider",
  1,"orange juice",
  1,"pear",
  1,"pear",
  2,"apple",
  2,"pear",
  3,"orange juice",
  3,"apple",
  4,"pear",
  4,"guava"
  )

I would like to filter with str_detect of the words "apple" AND "orange" on column "word". Return only observations if the "group" contains both words "apple" and "orange".

Desired output:

# A tibble: 6 x 2
  group word        
  <dbl> <chr>       
1     1 apple       
2     1 orange      
3     1 apple cider 
4     1 orange juice
5     3 orange juice
6     3 apple       

Thank you much!

Upvotes: 1

Views: 122

Answers (4)

akrun
akrun

Reputation: 886938

An option with str_extract and n_distinct.

library(dplyr)
library(stringr)
df %>% 
    group_by(group) %>% 
    filter((n_distinct(unlist(str_extract_all(word, "apple|orange"))) >1) &
          str_detect(word, 'apple|orange'))
# A tibble: 6 x 2
# Groups:   group [2]
#  group word        
#  <dbl> <chr>       
#1     1 apple       
#2     1 orange      
#3     1 apple cider 
#4     1 orange juice
#5     3 orange juice
#6     3 apple       

Explanation

After grouping by 'group', we extract all the 'word' that are either 'apple' or 'orange' with str_extract_all (the output will be a list by default), unlist the list, and count the number of distinct elements (n_distinct), check if it is greater than 1 as one condition which is joined with another condition that checks whether the 'word' column includes 'apple' or 'orange' (str_detect). Basically, it will only get those groups having both the cases and remove any other element in the process as well i.e. if we use only the first expression

df %>% 
     group_by(group) %>% 
     filter((n_distinct(unlist(str_extract_all(word, "apple|orange"))) >1))
# A tibble: 8 x 2
# Groups:   group [2]
#  group word        
#  <dbl> <chr>       
#1     1 apple       
#2     1 orange      
#3     1 apple cider 
#4     1 orange juice
#5     1 pear      # // not needed, but it was kept 
#6     1 pear      # // because it is checking on distinct element  
#7     3 orange juice
#8     3 apple     

With second expression alone

df %>% 
     group_by(group) %>% filter(str_detect(word, 'apple|orange'))
# A tibble: 7 x 2
# Groups:   group [3]
#  group word        
#  <dbl> <chr>       
#1     1 apple       
#2     1 orange      
#3     1 apple cider 
#4     1 orange juice
#5     2 apple    # // also keeps group 2 that includes only apple    
#6     3 orange juice
#7     3 apple      

By doing the &, the group 2 will be removed as well as the elements such as 'pear' in 'word' column

Upvotes: 1

ThomasIsCoding
ThomasIsCoding

Reputation: 101044

Here is a base R option using subset + ave

subset(
  df,
  as.logical(
    ave(
      word,
      group,
      FUN = function(x) any(grepl("apple", x)) & any(grepl("orange", x)) & grepl("apple|orange", x)
    )
  )
)

which gives

# A tibble: 6 x 2
  group word
  <dbl> <chr>
1     1 apple
2     1 orange
3     1 apple cider
4     1 orange juice
5     3 orange juice
6     3 apple

Upvotes: 0

Fatih Ekici
Fatih Ekici

Reputation: 133

the easiest way:

library(dplyr)
library(stringr)
df %>%
  filter(str_detect(word, 'apple|orange'))

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388807

We can filter those groups which have at least one value of "apple" as well as "orange" and only those rows in that group which has either "apple" or "orange" in word.

library(dplyr)
library(stringr)

df %>%
  group_by(group) %>%
  filter(any(str_detect(word, 'apple')) & any(str_detect(word, 'orange')) & 
            str_detect(word, 'orange|apple'))

# group word        
#  <dbl> <chr>       
#1     1 apple       
#2     1 orange      
#3     1 apple cider 
#4     1 orange juice     
#7     3 orange juice
#8     3 apple       

Upvotes: 1

Related Questions