ecl
ecl

Reputation: 389

Find matching words in a text from a vector

I got a data frame containing a set of shorter texts. I also have a vector containing a list of keywords. I want to add a new column where each text matched keywords are added as the value in a new column.

I add code to create a demo-version of my data frame.

id <- c(1,2,4,5,6,7)
full_text <- c("I like banana", "I ate an apple", "I prefer bananas and apples", "Grapes", "My applepie is tasty", "Fruitsalad")

df <- data.frame(id = id,full_text = full_text)

This gives the following data frame:

  id                   full_text
1  1               I like banana
2  2              I ate an apple
3  4 I prefer bananas and apples
4  5                      Grapes
5  6        My applepie is tasty
6  7                  Fruitsalad

I then have a vector containing some words. See below:

keywords <- c("banana", "apple", "grape")

In practical terms, I want to identify the observation who has one or more keywords in their df$full_text. If the df$full_text contains one or more of the words, I want to add those keywords to a new column called key_word. This should give a data frame similar to the one below:

  id                   full_text      key_word
1  1               I like banana        banana
2  2              I ate an apple         apple
3  4 I prefer bananas and apples banana, apple
4  5                      Grapes         grape
5  6        My applepie is tasty         apple
6  7                  Fruitsalad              

My initial strategy was to try to use ifelse with grepl but I couldn't get it to work.

Upvotes: 2

Views: 691

Answers (2)

rjen
rjen

Reputation: 1972

Using dplyr and stringr you can do as follows.

library(dplyr)
library(stringr)

as_tibble(df) %>%
  mutate(full_text = tolower(full_text),
         match = str_c(keywords, collapse = '|'),
         key_word = str_extract_all(full_text, match)) %>%
  select(-match)

Upvotes: 1

denis
denis

Reputation: 5673

Using stringr and str_replace_all you could do:

df$keyword <- str_extract_all(tolower(df$full_text),paste(keywords,collapse  = "|")) %>%
  lapply(.,function(x) paste(x,collapse = ", ")) %>%
  unlist()

paste(keywords,collapse = "|") is to express in regex "find any word of my vector": you use | to say or

paste(keywords,collapse  = "|")
[1] "banana|apple|grape"

str_extract_all gives you a list back with the various entries it finds for each entry of your vector:

str_extract_all(tolower(df$full_text),paste(keywords,collapse  = "|"))
[[1]]
[1] "banana"

[[2]]
[1] "apple"

[[3]]
[1] "banana" "apple" 

[[4]]
[1] "grape"

[[5]]
[1] "apple"

[[6]]
character(0)

So if you cllapse them together with function(x) paste(x,collapse = ", ") and unlist the list, you obtain what you wanted. I added the tolower because you want to recognize Grape with grape

Upvotes: 4

Related Questions