Reputation: 5424
I'm trying to combine dplyr and stringr to detect multiple patterns in a dataframe. I want to use dplyr as I want to test a number of different columns.
Here's some sample data:
test.data <- data.frame(item = c("Apple", "Bear", "Orange", "Pear", "Two Apples"))
fruit <- c("Apple", "Orange", "Pear")
test.data
item
1 Apple
2 Bear
3 Orange
4 Pear
5 Two Apples
What I would like to use is something like:
test.data <- test.data %>% mutate(is.fruit = str_detect(item, fruit))
and receive
item is.fruit
1 Apple 1
2 Bear 0
3 Orange 1
4 Pear 1
5 Two Apples 1
A very simple test works
> str_detect("Apple", fruit)
[1] TRUE FALSE FALSE
> str_detect("Bear", fruit)
[1] FALSE FALSE FALSE
But I can't get this to work over the column of the dataframe, even without dplyr:
> test.data$is.fruit <- str_detect(test.data$item, fruit)
Error in check_pattern(pattern, string) :
Lengths of string and pattern not compatible
Does anyone know how to do this?
Upvotes: 16
Views: 26847
Reputation: 1148
An alternate solution where you can filter only the rows that have those specific strings (or fruits in your case) could be to use:
test.data %>%
filter(str_detect(item, "Apple|Orange|Pear"))
The output will be
item
Apple
Orange
Pear
Two Apples
Upvotes: 0
Reputation: 2210
Using the map functions from purrr can simplify this further for convenient use in a pipe and format control - map_int returns numeric, map_lgl returns logical.
library(purrr)
test.data %>%
mutate(is.fruit = map_int(item, ~any(str_detect(., fruit))))
item is.fruit
1 Apple 1
2 Bear 0
3 Orange 1
4 Pear 1
5 Two Apples 1
Upvotes: 0
Reputation: 1151
This simple approach works fine for EXACT matches:
test.data %>% mutate(is.fruit = item %in% fruit)
# A tibble: 5 x 2
item is.fruit
<chr> <lgl>
1 Apple TRUE
2 Bear FALSE
3 Orange TRUE
4 Pear TRUE
5 Two Apples FALSE
This approach works for partial matching (which is the question asked):
test.data %>%
rowwise() %>%
mutate(is.fruit = sum(str_detect(item, fruit)))
Source: local data frame [5 x 2]
Groups: <by row>
# A tibble: 5 x 2
item is.fruit
<chr> <int>
1 Apple 1
2 Bear 0
3 Orange 1
4 Pear 1
5 Two Apples 1
Upvotes: 14
Reputation: 9344
str_detect
only accepts a length-1 pattern. Either turn it into one regex using paste(..., collapse = '|')
or use any
:
sapply(test.data$item, function(x) any(sapply(fruit, str_detect, string = x)))
# Apple Bear Orange Pear Two Apples
# TRUE FALSE TRUE TRUE TRUE
str_detect(test.data$item, paste(fruit, collapse = '|'))
# [1] TRUE FALSE TRUE TRUE TRUE
Upvotes: 27