Reputation: 172
I'm trying to use str_detect and case_when to recode strings based on multiple patterns, and paste each occurance of the recoded value(s) into a new column. The Correct column is the output I'm trying to achieve.
This is similar to this question and this question If it can't be done with case_when (limited to one pattern I think) is there a better way this can be achieved still using tidyverse?
Fruit=c("Apples","apples, maybe bananas","Oranges","grapes w apples","pears")
Num=c(1,2,3,4,5)
data=data.frame(Num,Fruit)
df= data %>% mutate(Incorrect=
paste(case_when(
str_detect(Fruit, regex("apples", ignore_case=TRUE)) ~ "good",
str_detect(Fruit, regex("bananas", ignore_case=TRUE)) ~ "gross",
str_detect(Fruit, regex("grapes | oranges", ignore_case=TRUE)) ~ "ok",
str_detect(Fruit, regex("lemon", ignore_case=TRUE)) ~ "sour",
TRUE ~ "other"
),sep=","))
Num Fruit Incorrect
1 Apples good
2 apples, maybe bananas good
3 Oranges other
4 grapes w apples good
5 pears other
Num Fruit Correct
1 Apples good
2 apples, maybe bananas good,gross
3 Oranges ok
4 grapes w apples ok,good
5 pears other
Upvotes: 3
Views: 4975
Reputation: 388862
In case_when
if a condition is satisfied for one row it stops there and doesn't check for any more conditions. So usually in such cases it is better to have every entry in separate row so that it easier to assign value and then summarise
all of them together. However, in this case Fruit
column does not have a clear separator, some fruits are separated by comma (,
), some are with whitespace and also there are additional words between them. To handle all such cases we assign NA
to the words which do not match and then remove them during summarising.
library(dplyr)
library(stringr)
data %>%
tidyr::separate_rows(Fruit, sep = ",|\\s+") %>%
mutate(Correct = case_when(
str_detect(Fruit, regex("apples", ignore_case=TRUE)) ~ "good",
str_detect(Fruit, regex("bananas", ignore_case=TRUE)) ~ "gross",
str_detect(Fruit, regex("grapes|oranges", ignore_case=TRUE)) ~ "ok",
str_detect(Fruit, regex("lemon", ignore_case=TRUE)) ~ "sour",
TRUE ~ NA_character_)) %>%
group_by(Num) %>%
summarise(Correct = toString(na.omit(Correct))) %>%
left_join(data)
# Num Correct Fruit
# <dbl> <chr> <fct>
#1 1 good Apples
#2 2 good, gross apples, maybe bananas
#3 3 ok Oranges
#4 4 ok, good grapes w apples
#5 5 sour Lemons
For the updated data, we can remove the extra words which occur and do
data %>%
mutate(Fruit = gsub("maybe|w", "", Fruit)) %>%
tidyr::separate_rows(Fruit, sep = ",\\s+|\\s+") %>%
mutate(Correct = case_when(
str_detect(Fruit, regex("apples", ignore_case=TRUE)) ~ "good",
str_detect(Fruit, regex("bananas", ignore_case=TRUE)) ~ "gross",
str_detect(Fruit, regex("grapes|oranges", ignore_case=TRUE)) ~ "ok",
str_detect(Fruit, regex("lemon", ignore_case=TRUE)) ~ "sour",
TRUE ~ "other")) %>%
group_by(Num) %>%
summarise(Correct = toString(na.omit(Correct))) %>%
left_join(data)
# Num Correct Fruit
# <dbl> <chr> <fct>
#1 1 good Apples
#2 2 good, gross apples, maybe bananas
#3 3 ok Oranges
#4 4 ok, good grapes w apples
#5 5 other pears
Upvotes: 6