Reputation: 43
I am trying to detect if specific keywords and phrases are present in a string, and if they are I want to post a specific number in a new column. My issue is that some strings have multiple keywords, but case_when only returns the first match. Is there way to fix this, or an alternative to case_when that I should be using?
ID<-c(1,2,3,4,5)
fruits<-c("banana apple orange", "apple orange", "orange", "orange apple", "nothing")
df<-data_frame(ID,fruits)
#I need to assign a random number to each fruit type
df %>%
mutate("Fruit Type"=case_when(
grepl("banana",fruits)~34,
grepl("apple",fruits)~45,
grepl("orange",fruits)~88,
))
ID fruits Fruit Type
1 banana apple orange 34
2 apple orange 45
3 orange 88
4 orange apple 45
5 nothing NA
I expected it to come out like this.
ID fruits fruit_type
1 banana apple orange 34
1 banana apple orange 45
1 banana apple orange 88
2 apple orange 45
2 apple orange 88
3 orange 88
4 orange apple 88
4 orange apple 45
5 nothing NA
Additionally, is there a way to convert this to long format in order to make it appear more like this?
ID fruits fruit_type fruit_type2 fruit_type3
1 banana apple orange 34 45 88
2 apple orange 45 88 NA
3 orange 88 NA NA
4 orange apple 88 45 NA
5 nothing NA NA NA
Upvotes: 4
Views: 2172
Reputation: 42544
Here is a completely different, database-like approach which uses a lookup table of fruit and fruit types . This approach can handle an arbitrary number of fruits and fruit types.
# create or read lookup table
lut <- readr::read_table(
"fruit fruit_type
banana 34
apple 45
orange 88")
library(dplyr)
library(tidyr)
df %>%
mutate(fruit = fruits) %>%
separate_rows(fruit, sep = "\\s+") %>%
left_join(lut, by = "fruit") %>%
group_by(ID) %>%
mutate(rowid = row_number(ID)) %>%
pivot_wider(id_cols = c(ID, fruits), values_from = fruit_type,
names_prefix = "fruit_type", names_from = rowid)
ID fruits fruit_type1 fruit_type2 fruit_type3 <dbl> <chr> <dbl> <dbl> <dbl> 1 1 banana apple orange 34 45 88 2 2 apple orange 45 88 NA 3 3 orange 88 NA NA 4 4 orange apple 88 45 NA 5 5 nothing NA NA NA
The fruits
column is copied and then split. Now, column fruit
contains a single fruit on separate rows. These are joined with the lookup table lut
to get the matching fruit_type
value. Before this result can be reshaped to wide format, the new columns need to be numbered. This is achieved by numbering the rows within each ID
.
According to OP's comment the production dataset contains paragraphs where the keywords aren't separated by white space but also by punctuation marks like commas or appear in their plural form with trailing s. In addition, the keywords may be written in upper case or may appear multiple times in a paragraph.
Instead of separating all words we can try to extract the keywords from the paragraphs. This can be achieved by combining all keywords into one regular expression with alternation |
. So, the regular expression banana|apple|orange
will match either of the fruits.
For testing we need a more complex use case:
df <- tibble(fruits = readr::read_lines(
"There are bananas, oranges, and also apples here
One Orange and another orange make two Oranges
apples and pineapples go together
But pineapples alone must not be counted
banana apple orange
apple orange
orange
orange apple
nothing")
) %>%
mutate(ID = row_number())
With the modified code
df %>%
mutate(fruit = fruits %>%
tolower() %>%
stringr::str_extract_all(paste(lut$fruit, collapse = "|")) %>%
lapply(unique)) %>%
unnest(fruit, keep_empty = TRUE) %>%
left_join(lut, by = "fruit") %>%
group_by(ID) %>%
mutate(rowid = row_number(ID)) %>%
pivot_wider(id_cols = c(ID, fruits), values_from = fruit_type,
names_prefix = "fruit_type", names_from = rowid)
we get
ID fruits fruit_type1 fruit_type2 fruit_type3 <int> <chr> <dbl> <dbl> <dbl> 1 1 "There are bananas, oranges, and also apples here" 34 88 45 2 2 "One Orange and another Orange make two Oranges " 88 NA NA 3 3 "apples and pineapples go together" 45 NA NA 4 4 "But pineapples alone must not be counted" 45 NA NA 5 5 "banana apple orange" 34 45 88 6 6 "apple orange" 45 88 NA 7 7 "orange" 88 NA NA 8 8 "orange apple" 88 45 NA 9 9 "nothing" NA NA NA
This approach has detected keywords in plural form and independent of upper/lower case.
Note that I have deliberately chosen to count multiple occurrences of a keyword in a paragraph only once by lapply(unique)
. If each occurrence is to be counted separately then just remove that line of code.
However, there is one (at least) drawback of this approach: The word pineapple
is counted as apple
because it contains apple
as substring.
Upvotes: 1
Reputation: 78917
Here is another solution with separate
, str_detect
and across
:
library(dplyr)
library(tidyr)
library(stringr)
df %>%
separate(fruits, c("fruit_type1", "fruit_type2", "fruit_type3"), remove = FALSE) %>%
mutate(across(contains("fruit_type"), ~case_when(
str_detect(., "banana") ~ 34,
str_detect(., "apple") ~ 45,
str_detect(., "orange") ~ 88)
))
Output:
ID fruits fruit_type1 fruit_type2 fruit_type3
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 banana apple orange 34 45 88
2 2 apple orange 45 88 NA
3 3 orange 88 NA NA
4 4 orange apple 88 45 NA
5 5 nothing NA NA NA
Upvotes: 3
Reputation: 388807
You can first bring different fruits in separate rows and then use case_when
-
library(dplyr)
library(tidyr)
res <- df %>%
separate_rows(fruits, sep = '\\s+') %>%
mutate(Fruit_Type =case_when(
grepl("banana",fruits)~34,
grepl("apple",fruits)~45,
grepl("orange",fruits)~88,
))
res
# ID fruits Fruit_Type
# <dbl> <chr> <dbl>
#1 1 banana 34
#2 1 apple 45
#3 1 orange 88
#4 2 apple 45
#5 2 orange 88
#6 3 orange 88
#7 4 orange 88
#8 4 apple 45
#9 5 nothing NA
To get the data in wide format you can do -
res %>%
group_by(ID) %>%
mutate(row = paste0('Fruit', row_number()),
fruits = paste0(fruits, collapse = ' ')) %>%
ungroup %>%
pivot_wider(names_from = row, values_from = Fruit_Type)
# ID fruits Fruit1 Fruit2 Fruit3
# <dbl> <chr> <dbl> <dbl> <dbl>
#1 1 banana apple orange 34 45 88
#2 2 apple orange 45 88 NA
#3 3 orange 88 NA NA
#4 4 orange apple 88 45 NA
#5 5 nothing NA NA NA
Upvotes: 2
Reputation: 2594
You can make use of some useful functions in {tidyr}, like pivot_longer()
once you have a wide data frame. (There is also a function pivot_wider()
to do the opposite.)
The solution below creates a wider data frame first, and then narrows it to a longer data frame. So it generates the data frames in the opposite order that you've listed.
library(dplyr)
library(tidyr)
ID <- c(1, 2, 3, 4, 5)
fruits <- c("banana apple orange",
"apple orange",
"orange",
"orange apple",
"nothing")
df <- tibble(ID, fruits)
new_df <-
df %>%
mutate(fruit_type = if_else(grepl("banana", fruits), 34, NA_real_),
fruit_type2 = if_else(grepl("apple", fruits), 45, NA_real_),
fruit_type3 = if_else(grepl("orange", fruits), 88, NA_real_))
new_df
#> # A tibble: 5 x 5
#> ID fruits fruit_type fruit_type2 fruit_type3
#> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 1 banana apple orange 34 45 88
#> 2 2 apple orange NA 45 88
#> 3 3 orange NA NA 88
#> 4 4 orange apple NA 45 88
#> 5 5 nothing NA NA NA
long_df <-
new_df %>%
pivot_longer(cols = starts_with("fruit_type"), names_to = "fruit_type") %>%
select(-fruit_type) %>%
rename(fruit_type = value) %>%
distinct() %>% # Remove duplicates
group_by(ID, fruits) %>%
mutate(n = n()) %>%
filter(!is.na(fruit_type) | n == 1) %>%
select(-n)
long_df
#> # A tibble: 9 x 3
#> # Groups: ID, fruits [5]
#> ID fruits fruit_type
#> <dbl> <chr> <dbl>
#> 1 1 banana apple orange 34
#> 2 1 banana apple orange 45
#> 3 1 banana apple orange 88
#> 4 2 apple orange 45
#> 5 2 apple orange 88
#> 6 3 orange 88
#> 7 4 orange apple 45
#> 8 4 orange apple 88
#> 9 5 nothing NA
Created on 2021-07-23 by the reprex package (v2.0.0)
Upvotes: 0
Reputation: 886938
We get the first output by using str_replace_all
to change the string values with a named vector, then split the 'fruit_type' column at whitespace to expand the data (separate_rows
), and change the type of 'fruit_type' to numeric
class
library(dplyr)
library(tidyr)
library(stringr)
out <- df %>%
mutate(fruit_type = str_replace_all(fruits,
setNames(as.character(c(34, 45, 88)), c("banana", "apple", "orange")))) %>%
separate_rows(fruit_type) %>%
mutate(fruit_type = as.numeric(fruit_type))
-output
out
# A tibble: 9 x 3
ID fruits fruit_type
<dbl> <chr> <dbl>
1 1 banana apple orange 34
2 1 banana apple orange 45
3 1 banana apple orange 88
4 2 apple orange 45
5 2 apple orange 88
6 3 orange 88
7 4 orange apple 88
8 4 orange apple 45
9 5 nothing NA
With this output, we can reshape to 'wide' format with pivot_wider
library(data.table)
out %>%
mutate(rn = str_c('fruit_type', rowid(ID))) %>%
pivot_wider(names_from = rn, values_from = fruit_type)
-output
# A tibble: 5 x 5
ID fruits fruit_type1 fruit_type2 fruit_type3
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 banana apple orange 34 45 88
2 2 apple orange 45 88 NA
3 3 orange 88 NA NA
4 4 orange apple 88 45 NA
5 5 nothing NA NA NA
Upvotes: 2