Fraxxx
Fraxxx

Reputation: 114

Finding list of word present in column of a Dataframe using Grepl in R

I have a dataframe df:

df <- structure(list(page = c(12, 6, 9, 65),
text = structure(c(4L,2L, 1L, 3L), 
.Label = c("I just bought a brand new AudiA6", "Get 2 years engine replacement warranty on BMW X6", 
"Volkswagen is the parent company of BMW", "ToyotaCorolla is offering new car exchange offers"), 
class = "factor")), .Names = c("page","text"), row.names = c(NA, -4L), class = "data.frame")

Also, I have a list of word:

wordlist <- c("Audi", "BMW", "extended", "engine", "replacement", "Volkswagen", "company", "Toyota","exchange", "brand")

I looked for the words from wordlist are present in the column text or not by unlisting the text and using grepl.

library(data.table)
setDT(df)[, match := paste(wordlist[unlist(lapply(wordlist, function(x) grepl(x, text, ignore.case = T)))], collapse = ","), by = 1:nrow(df)]

The problem is, I want to find exact words of the wordlist present in Column text. With grepl it also shows word with partial match, for example AudiA6 from text was also partially matched to word Audi present in wordlist. Also my dataframe is very big and using grepl take a lot time in running the code. Please, if possible recommend any other Approach to do so. I want something like this:

df <- structure(list(page = c(12, 6, 9, 65), 
text = structure(c(4L,2L, 1L, 3L), 
.Label = c("I just bought a brand new AudiA6", "Get 2 years engine replacement warranty on BMW X6", 
 "Volkswagen is the parent company of BMW", "ToyotaCorolla is offering new car exchange offers"),
class = "factor"), match = c("exchange", "BMW,engine,replacement", 
"brand", "BMW,Volkswagen,company")), row.names = c(NA, -4L), 
class = c("data.table", "data.frame"))

Upvotes: 2

Views: 1013

Answers (1)

Cath
Cath

Reputation: 24074

You can use str_extract_all from stringr after adding word boundaries (\\b) to each of the words you want to extract so only full matches are considered (and you need to collapse all words with "|" to indicate a "or"):

sapply(stringr::str_extract_all(df$text, paste("\\b", wordlist, "\\b", sep="", collapse="|")), paste, collapse=",")
# [1] "exchange"               "engine,replacement,BMW" "brand"                  "Volkswagen,company,BMW"

If you want to put it in your data.table:

df[, match:=sapply(stringr::str_extract_all(text, paste("\\b", wordlist, "\\b", sep="", collapse="|")), paste, collapse=",")]
df
#   page                                              text                  match
#1:   12 ToyotaCorolla is offering new car exchange offers               exchange
#2:    6 Get 2 years engine replacement warranty on BMW X6 engine,replacement,BMW
#3:    9                  I just bought a brand new AudiA6                  brand
#4:   65           Volkswagen is the parent company of BMW Volkswagen,company,BMW

Upvotes: 6

Related Questions