emv7
emv7

Reputation: 43

case_when() issue with evaluating multiple conditions

I am trying to detect if specific keywords and phrases are present in a string, and if they are I want to post a specific number in a new column. My issue is that some strings have multiple keywords, but case_when only returns the first match. Is there way to fix this, or an alternative to case_when that I should be using?

ID<-c(1,2,3,4,5)
fruits<-c("banana apple orange", "apple orange", "orange", "orange apple", "nothing")
df<-data_frame(ID,fruits)
#I need to assign a random number to each fruit type

df %>% 
  mutate("Fruit Type"=case_when(
    grepl("banana",fruits)~34,
    grepl("apple",fruits)~45,
    grepl("orange",fruits)~88,
))

ID       fruits                  Fruit Type
1      banana apple orange           34
2      apple orange                  45
3      orange                        88
4      orange apple                  45
5      nothing                       NA

I expected it to come out like this.

ID        fruits       fruit_type         
1   banana apple orange    34       
1   banana apple orange    45       
1   banana apple orange    88       
2   apple orange           45       
2   apple orange           88       
3   orange                 88       
4   orange apple           88       
4   orange apple           45       
5   nothing                NA

Additionally, is there a way to convert this to long format in order to make it appear more like this?

ID        fruits       fruit_type  fruit_type2  fruit_type3     
1   banana apple orange    34        45              88                                     
2   apple orange           45        88              NA                                     
3   orange                 88        NA              NA                             
4   orange apple           88        45              NA                                                     
5   nothing                NA        NA              NA         

Upvotes: 4

Views: 2172

Answers (5)

Uwe
Uwe

Reputation: 42544

Here is a completely different, database-like approach which uses a lookup table of fruit and fruit types . This approach can handle an arbitrary number of fruits and fruit types.

# create or read lookup table
lut <- readr::read_table(
"fruit    fruit_type
banana           34
apple            45
orange           88")

library(dplyr)
library(tidyr)
df %>% 
  mutate(fruit = fruits) %>% 
  separate_rows(fruit, sep = "\\s+") %>% 
  left_join(lut, by = "fruit") %>% 
  group_by(ID) %>% 
  mutate(rowid = row_number(ID)) %>% 
  pivot_wider(id_cols = c(ID, fruits), values_from = fruit_type, 
              names_prefix = "fruit_type", names_from = rowid)
     ID fruits              fruit_type1 fruit_type2 fruit_type3
  <dbl> <chr>                     <dbl>       <dbl>       <dbl>
1     1 banana apple orange          34          45          88
2     2 apple orange                 45          88          NA
3     3 orange                       88          NA          NA
4     4 orange apple                 88          45          NA
5     5 nothing                      NA          NA          NA

The fruits column is copied and then split. Now, column fruit contains a single fruit on separate rows. These are joined with the lookup table lut to get the matching fruit_type value. Before this result can be reshaped to wide format, the new columns need to be numbered. This is achieved by numbering the rows within each ID.

Edit:

According to OP's comment the production dataset contains paragraphs where the keywords aren't separated by white space but also by punctuation marks like commas or appear in their plural form with trailing s. In addition, the keywords may be written in upper case or may appear multiple times in a paragraph.

Instead of separating all words we can try to extract the keywords from the paragraphs. This can be achieved by combining all keywords into one regular expression with alternation |. So, the regular expression banana|apple|orange will match either of the fruits.

For testing we need a more complex use case:

df <- tibble(fruits = readr::read_lines(
"There are bananas, oranges, and also apples here
One Orange and another orange make two Oranges 
apples and pineapples go together
But pineapples alone must not be counted
banana apple orange
apple orange
orange
orange apple
nothing")
) %>% 
  mutate(ID = row_number())

With the modified code

df %>% 
  mutate(fruit = fruits %>% 
           tolower() %>% 
           stringr::str_extract_all(paste(lut$fruit, collapse = "|")) %>% 
           lapply(unique)) %>% 
  unnest(fruit, keep_empty = TRUE) %>% 
  left_join(lut, by = "fruit") %>% 
  group_by(ID) %>% 
  mutate(rowid = row_number(ID)) %>% 
  pivot_wider(id_cols = c(ID, fruits), values_from = fruit_type, 
              names_prefix = "fruit_type", names_from = rowid)

we get

     ID fruits                                             fruit_type1 fruit_type2 fruit_type3
  <int> <chr>                                                    <dbl>       <dbl>       <dbl>
1     1 "There are bananas, oranges, and also apples here"          34          88          45
2     2 "One Orange and another Orange make two Oranges "           88          NA          NA
3     3 "apples and pineapples go together"                         45          NA          NA
4     4 "But pineapples alone must not be counted"                  45          NA          NA
5     5 "banana apple orange"                                       34          45          88
6     6 "apple orange"                                              45          88          NA
7     7 "orange"                                                    88          NA          NA
8     8 "orange apple"                                              88          45          NA
9     9 "nothing"                                                   NA          NA          NA

This approach has detected keywords in plural form and independent of upper/lower case.

Note that I have deliberately chosen to count multiple occurrences of a keyword in a paragraph only once by lapply(unique). If each occurrence is to be counted separately then just remove that line of code.

However, there is one (at least) drawback of this approach: The word pineapple is counted as apple because it contains apple as substring.

Upvotes: 1

TarJae
TarJae

Reputation: 78917

Here is another solution with separate, str_detect and across:

library(dplyr)
library(tidyr)
library(stringr)

df %>% 
  separate(fruits, c("fruit_type1", "fruit_type2", "fruit_type3"), remove = FALSE) %>% 
  mutate(across(contains("fruit_type"), ~case_when(
    str_detect(., "banana") ~ 34,
    str_detect(., "apple") ~ 45,
    str_detect(., "orange") ~ 88)
  ))

Output:

     ID fruits              fruit_type1 fruit_type2 fruit_type3
  <dbl> <chr>                     <dbl>       <dbl>       <dbl>
1     1 banana apple orange          34          45          88
2     2 apple orange                 45          88          NA
3     3 orange                       88          NA          NA
4     4 orange apple                 88          45          NA
5     5 nothing                      NA          NA          NA

Upvotes: 3

Ronak Shah
Ronak Shah

Reputation: 388807

You can first bring different fruits in separate rows and then use case_when -

library(dplyr)
library(tidyr)

res <- df %>%
  separate_rows(fruits, sep = '\\s+') %>%
  mutate(Fruit_Type =case_when(
    grepl("banana",fruits)~34,
    grepl("apple",fruits)~45,
    grepl("orange",fruits)~88,
  ))
  
res

#     ID fruits  Fruit_Type
#  <dbl> <chr>        <dbl>
#1     1 banana          34
#2     1 apple           45
#3     1 orange          88
#4     2 apple           45
#5     2 orange          88
#6     3 orange          88
#7     4 orange          88
#8     4 apple           45
#9     5 nothing         NA

To get the data in wide format you can do -

res %>%
  group_by(ID) %>%
  mutate(row = paste0('Fruit', row_number()), 
         fruits = paste0(fruits, collapse = ' ')) %>%
  ungroup %>%
  pivot_wider(names_from = row, values_from = Fruit_Type)

#    ID fruits              Fruit1 Fruit2 Fruit3
#  <dbl> <chr>                <dbl>  <dbl>  <dbl>
#1     1 banana apple orange     34     45     88
#2     2 apple orange            45     88     NA
#3     3 orange                  88     NA     NA
#4     4 orange apple            88     45     NA
#5     5 nothing                 NA     NA     NA

Upvotes: 2

Eric Leung
Eric Leung

Reputation: 2594

You can make use of some useful functions in {tidyr}, like pivot_longer() once you have a wide data frame. (There is also a function pivot_wider() to do the opposite.)

The solution below creates a wider data frame first, and then narrows it to a longer data frame. So it generates the data frames in the opposite order that you've listed.

library(dplyr)
library(tidyr)

ID <- c(1, 2, 3, 4, 5)
fruits <- c("banana apple orange",
            "apple orange",
            "orange",
            "orange apple",
            "nothing")
df <- tibble(ID, fruits)

new_df <- 
  df %>%
  mutate(fruit_type = if_else(grepl("banana", fruits), 34, NA_real_),
         fruit_type2 = if_else(grepl("apple", fruits), 45, NA_real_),
         fruit_type3 = if_else(grepl("orange", fruits), 88, NA_real_))
new_df
#> # A tibble: 5 x 5
#>      ID fruits              fruit_type fruit_type2 fruit_type3
#>   <dbl> <chr>                    <dbl>       <dbl>       <dbl>
#> 1     1 banana apple orange         34          45          88
#> 2     2 apple orange                NA          45          88
#> 3     3 orange                      NA          NA          88
#> 4     4 orange apple                NA          45          88
#> 5     5 nothing                     NA          NA          NA

long_df <-
  new_df %>%
  pivot_longer(cols = starts_with("fruit_type"), names_to = "fruit_type") %>%
  select(-fruit_type) %>%
  rename(fruit_type = value) %>%
  distinct() %>%  # Remove duplicates
  group_by(ID, fruits) %>%
  mutate(n = n()) %>%
  filter(!is.na(fruit_type) | n == 1) %>%
  select(-n)
long_df
#> # A tibble: 9 x 3
#> # Groups:   ID, fruits [5]
#>      ID fruits              fruit_type
#>   <dbl> <chr>                    <dbl>
#> 1     1 banana apple orange         34
#> 2     1 banana apple orange         45
#> 3     1 banana apple orange         88
#> 4     2 apple orange                45
#> 5     2 apple orange                88
#> 6     3 orange                      88
#> 7     4 orange apple                45
#> 8     4 orange apple                88
#> 9     5 nothing                     NA

Created on 2021-07-23 by the reprex package (v2.0.0)

Upvotes: 0

akrun
akrun

Reputation: 886938

We get the first output by using str_replace_all to change the string values with a named vector, then split the 'fruit_type' column at whitespace to expand the data (separate_rows), and change the type of 'fruit_type' to numeric class

library(dplyr)
library(tidyr)
library(stringr)
out <- df %>% 
    mutate(fruit_type = str_replace_all(fruits, 
     setNames(as.character(c(34, 45, 88)), c("banana", "apple", "orange")))) %>% 
    separate_rows(fruit_type) %>%
    mutate(fruit_type = as.numeric(fruit_type))

-output

out
# A tibble: 9 x 3
     ID fruits              fruit_type
  <dbl> <chr>                    <dbl>
1     1 banana apple orange         34
2     1 banana apple orange         45
3     1 banana apple orange         88
4     2 apple orange                45
5     2 apple orange                88
6     3 orange                      88
7     4 orange apple                88
8     4 orange apple                45
9     5 nothing                     NA

With this output, we can reshape to 'wide' format with pivot_wider

library(data.table)
out %>% 
    mutate(rn = str_c('fruit_type', rowid(ID))) %>% 
    pivot_wider(names_from = rn, values_from = fruit_type)

-output

# A tibble: 5 x 5
     ID fruits              fruit_type1 fruit_type2 fruit_type3
  <dbl> <chr>                     <dbl>       <dbl>       <dbl>
1     1 banana apple orange          34          45          88
2     2 apple orange                 45          88          NA
3     3 orange                       88          NA          NA
4     4 orange apple                 88          45          NA
5     5 nothing                      NA          NA          NA

Upvotes: 2

Related Questions