flamenco
flamenco

Reputation: 2840

Check for a list of strings (words) in a text (phrase)

Is there an elegant way, other than looping, to test if a word which belong to a list is found in a phrase? I'm thinking something like list comprehension of one of the apply functions. Ex:

words <- c("word1", "word2", "word3")
text <- "This is a text made off of word1 and possibly word2 and so on."

The output should return TRUE if any of the words is founded in text and which word is founded.

Upvotes: 4

Views: 6876

Answers (3)

Abraham JA
Abraham JA

Reputation: 336

rflashtext can help if length(words) is huge and you need to extract only exact words '\\b%s\\b'.

library(rflashtext)

processor <- keyword_processor$new()
processor$add_keys_words(keys = words)
words_found <- processor$find_keys(sentence = text)
length(words_found) > 0 # TRUE if any of the words is found
do.call(rbind, words_found) # Found words with spam info (end of word + 1)
     word    start end
[1,] "word1" 28    33 
[2,] "word2" 47    52

A little benchmark:

words <- sprintf("word%s", seq_len(n))
text <- "This is a text made off of word1 and possibly word2 and so on."

processor <- keyword_processor$new()
processor$add_keys_words(keys = words)

microbenchmark(sapply = sapply(words, grepl, text),
               sapply_exact = sapply(words, function(x) grepl(sprintf('\\b%s\\b', x), text)),
               stringr = str_locate_all(text, words),
               stringr_exact = str_locate_all(text, sprintf('\\b%s\\b', words)),
               rflashtext = processor$find_keys(sentence = text), unit = "relative")

# n = 10

Unit: relative
          expr      min       lq     mean   median       uq        max neval
        sapply 1.009416 1.031136 1.024220 1.029903 1.026094  1.2367347   100
  sapply_exact 1.468927 1.468864 1.670976 1.468777 1.471380 13.8183673   100
       stringr 1.000000 1.000000 1.000000 1.000000 1.000000  1.0000000   100
 stringr_exact 1.137476 1.143773 1.122424 1.131047 1.135522  0.8173469   100
    rflashtext 1.841808 1.903846 1.885012 1.878628 1.853535  3.1908163   100

# n = 1000

Unit: relative
          expr      min       lq     mean   median       uq      max neval
        sapply 44.29630 36.76002 32.37855 32.26093 29.48758 15.99403   100
  sapply_exact 67.19219 55.81263 50.64435 48.81392 45.08299 57.05292   100
       stringr 42.92593 35.09389 30.65920 30.61525 27.93996 13.76364   100
 stringr_exact 48.59459 39.88304 35.12447 34.80867 32.10688 16.74084   100
    rflashtext  1.00000  1.00000  1.00000  1.00000  1.00000  1.00000   100

Upvotes: 1

jbaums
jbaums

Reputation: 27388

grepl to the rescue.

sapply(words, grepl, text)

# word1 word2 word3 
#  TRUE  TRUE FALSE

This considers each element of words, in turn, and returns a logical (TRUE if the the word appears in text, and FALSE if not).

If you want to ensure that the exact words are sought, then you can use:

sapply(words, function(x) grepl(sprintf('\\b%s\\b', x), text))

This will prevent word1 from returning TRUE when text has sword123 but lacks word1. It might make less sense though if words has multi-word elements.

Upvotes: 9

Philippe
Philippe

Reputation: 204

Look at the package stringr. I think the function you need to use is str_detect or str_locate_all. It's is to include this function in sapply.

library(stringr)

str_detect(text, words)

str_locate_all(text, words)

Upvotes: 2

Related Questions