Reputation: 385
Data structure:
library(tidyverse)
df <- tribble(
~"group", ~"word",
1,"apple",
1,"orange",
1,"apple cider",
1,"orange juice",
1,"pear",
1,"pear",
2,"apple",
2,"pear",
3,"orange juice",
3,"apple",
4,"pear",
4,"guava"
)
I would like to filter with str_detect of the words "apple" AND "orange" on column "word". Return only observations if the "group" contains both words "apple" and "orange".
Desired output:
# A tibble: 6 x 2
group word
<dbl> <chr>
1 1 apple
2 1 orange
3 1 apple cider
4 1 orange juice
5 3 orange juice
6 3 apple
Thank you much!
Upvotes: 1
Views: 122
Reputation: 886938
An option with str_extract
and n_distinct
.
library(dplyr)
library(stringr)
df %>%
group_by(group) %>%
filter((n_distinct(unlist(str_extract_all(word, "apple|orange"))) >1) &
str_detect(word, 'apple|orange'))
# A tibble: 6 x 2
# Groups: group [2]
# group word
# <dbl> <chr>
#1 1 apple
#2 1 orange
#3 1 apple cider
#4 1 orange juice
#5 3 orange juice
#6 3 apple
Explanation
After grouping by 'group', we extract all the 'word' that are either 'apple' or 'orange' with str_extract_all
(the output will be a list
by default), unlist
the list
, and count the number of distinct elements (n_distinct
), check if it is greater than 1 as one condition which is joined with another condition that checks whether the 'word' column includes 'apple' or 'orange' (str_detect
). Basically, it will only get those groups having both the cases and remove any other element in the process as well i.e. if we use only the first expression
df %>%
group_by(group) %>%
filter((n_distinct(unlist(str_extract_all(word, "apple|orange"))) >1))
# A tibble: 8 x 2
# Groups: group [2]
# group word
# <dbl> <chr>
#1 1 apple
#2 1 orange
#3 1 apple cider
#4 1 orange juice
#5 1 pear # // not needed, but it was kept
#6 1 pear # // because it is checking on distinct element
#7 3 orange juice
#8 3 apple
With second expression alone
df %>%
group_by(group) %>% filter(str_detect(word, 'apple|orange'))
# A tibble: 7 x 2
# Groups: group [3]
# group word
# <dbl> <chr>
#1 1 apple
#2 1 orange
#3 1 apple cider
#4 1 orange juice
#5 2 apple # // also keeps group 2 that includes only apple
#6 3 orange juice
#7 3 apple
By doing the &
, the group 2 will be removed as well as the elements such as 'pear' in 'word' column
Upvotes: 1
Reputation: 101044
Here is a base R option using subset
+ ave
subset(
df,
as.logical(
ave(
word,
group,
FUN = function(x) any(grepl("apple", x)) & any(grepl("orange", x)) & grepl("apple|orange", x)
)
)
)
which gives
# A tibble: 6 x 2
group word
<dbl> <chr>
1 1 apple
2 1 orange
3 1 apple cider
4 1 orange juice
5 3 orange juice
6 3 apple
Upvotes: 0
Reputation: 133
the easiest way:
library(dplyr)
library(stringr)
df %>%
filter(str_detect(word, 'apple|orange'))
Upvotes: 1
Reputation: 388807
We can filter those groups which have at least one value of "apple"
as well as "orange"
and only those rows in that group which has either "apple"
or "orange"
in word
.
library(dplyr)
library(stringr)
df %>%
group_by(group) %>%
filter(any(str_detect(word, 'apple')) & any(str_detect(word, 'orange')) &
str_detect(word, 'orange|apple'))
# group word
# <dbl> <chr>
#1 1 apple
#2 1 orange
#3 1 apple cider
#4 1 orange juice
#7 3 orange juice
#8 3 apple
Upvotes: 1