Reputation: 389
I got a data frame containing a set of shorter texts. I also have a vector containing a list of keywords. I want to add a new column where each text matched keywords are added as the value in a new column.
I add code to create a demo-version of my data frame.
id <- c(1,2,4,5,6,7)
full_text <- c("I like banana", "I ate an apple", "I prefer bananas and apples", "Grapes", "My applepie is tasty", "Fruitsalad")
df <- data.frame(id = id,full_text = full_text)
This gives the following data frame:
id full_text
1 1 I like banana
2 2 I ate an apple
3 4 I prefer bananas and apples
4 5 Grapes
5 6 My applepie is tasty
6 7 Fruitsalad
I then have a vector containing some words. See below:
keywords <- c("banana", "apple", "grape")
In practical terms, I want to identify the observation who has one or more keywords
in their df$full_text
. If the df$full_text
contains one or more of the words, I want to add those keywords to a new column called key_word
. This should give a data frame similar to the one below:
id full_text key_word
1 1 I like banana banana
2 2 I ate an apple apple
3 4 I prefer bananas and apples banana, apple
4 5 Grapes grape
5 6 My applepie is tasty apple
6 7 Fruitsalad
My initial strategy was to try to use ifelse
with grepl
but I couldn't get it to work.
Upvotes: 2
Views: 691
Reputation: 1972
Using dplyr and stringr you can do as follows.
library(dplyr)
library(stringr)
as_tibble(df) %>%
mutate(full_text = tolower(full_text),
match = str_c(keywords, collapse = '|'),
key_word = str_extract_all(full_text, match)) %>%
select(-match)
Upvotes: 1
Reputation: 5673
Using stringr
and str_replace_all
you could do:
df$keyword <- str_extract_all(tolower(df$full_text),paste(keywords,collapse = "|")) %>%
lapply(.,function(x) paste(x,collapse = ", ")) %>%
unlist()
paste(keywords,collapse = "|")
is to express in regex "find any word of my vector": you use |
to say or
paste(keywords,collapse = "|")
[1] "banana|apple|grape"
str_extract_all
gives you a list back with the various entries it finds for each entry of your vector:
str_extract_all(tolower(df$full_text),paste(keywords,collapse = "|"))
[[1]]
[1] "banana"
[[2]]
[1] "apple"
[[3]]
[1] "banana" "apple"
[[4]]
[1] "grape"
[[5]]
[1] "apple"
[[6]]
character(0)
So if you cllapse them together with function(x) paste(x,collapse = ", ")
and unlist
the list, you obtain what you wanted. I added the tolower
because you want to recognize Grape
with grape
Upvotes: 4