Reputation: 2840
Is there an elegant way, other than looping, to test if a word which belong to a list is found in a phrase?
I'm thinking something like list comprehension of one of the apply
functions.
Ex:
words <- c("word1", "word2", "word3")
text <- "This is a text made off of word1 and possibly word2 and so on."
The output should return TRUE if any of the words is founded in text and which word is founded.
Upvotes: 4
Views: 6876
Reputation: 336
rflashtext can help if length(words)
is huge and you need to extract only exact words '\\b%s\\b'
.
library(rflashtext)
processor <- keyword_processor$new()
processor$add_keys_words(keys = words)
words_found <- processor$find_keys(sentence = text)
length(words_found) > 0 # TRUE if any of the words is found
do.call(rbind, words_found) # Found words with spam info (end of word + 1)
word start end
[1,] "word1" 28 33
[2,] "word2" 47 52
A little benchmark:
words <- sprintf("word%s", seq_len(n))
text <- "This is a text made off of word1 and possibly word2 and so on."
processor <- keyword_processor$new()
processor$add_keys_words(keys = words)
microbenchmark(sapply = sapply(words, grepl, text),
sapply_exact = sapply(words, function(x) grepl(sprintf('\\b%s\\b', x), text)),
stringr = str_locate_all(text, words),
stringr_exact = str_locate_all(text, sprintf('\\b%s\\b', words)),
rflashtext = processor$find_keys(sentence = text), unit = "relative")
# n = 10
Unit: relative
expr min lq mean median uq max neval
sapply 1.009416 1.031136 1.024220 1.029903 1.026094 1.2367347 100
sapply_exact 1.468927 1.468864 1.670976 1.468777 1.471380 13.8183673 100
stringr 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 100
stringr_exact 1.137476 1.143773 1.122424 1.131047 1.135522 0.8173469 100
rflashtext 1.841808 1.903846 1.885012 1.878628 1.853535 3.1908163 100
# n = 1000
Unit: relative
expr min lq mean median uq max neval
sapply 44.29630 36.76002 32.37855 32.26093 29.48758 15.99403 100
sapply_exact 67.19219 55.81263 50.64435 48.81392 45.08299 57.05292 100
stringr 42.92593 35.09389 30.65920 30.61525 27.93996 13.76364 100
stringr_exact 48.59459 39.88304 35.12447 34.80867 32.10688 16.74084 100
rflashtext 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 100
Upvotes: 1
Reputation: 27388
grepl
to the rescue.
sapply(words, grepl, text)
# word1 word2 word3
# TRUE TRUE FALSE
This considers each element of words
, in turn, and returns a logical (TRUE
if the the word appears in text
, and FALSE
if not).
If you want to ensure that the exact words are sought, then you can use:
sapply(words, function(x) grepl(sprintf('\\b%s\\b', x), text))
This will prevent word1
from returning TRUE
when text has sword123
but lacks word1
. It might make less sense though if words
has multi-word elements.
Upvotes: 9
Reputation: 204
Look at the package stringr
.
I think the function you need to use is str_detect
or str_locate_all
. It's is to include this function in sapply
.
library(stringr)
str_detect(text, words)
str_locate_all(text, words)
Upvotes: 2