find words in a text divided in sentences R

Question

Hello there I have a text and I'd like to retrieve only the sentences that contains certain words. Here is an example.

my_text<- tolower(c("Pro is a molecule that can be found in the air. This molecule spreads glitter and allows bees to fly over the rainbow. For flying, bees need another molecule that is Sub. Sub is activated and so Sub is a substrate. After eating that molecule bees become very speed and they can fly highly. Pro activate Sub. This means that Sub is catalyzed by Pro."))


my_words <- tolower(c("Pro", "Pra", "Pri", "Pre", "Pru", 
         "Sab", "Seb", "Sib", "Sob", "Sub"))

sent <- unlist(strsplit(my_text, "\."))

sent <- sent[grep(pattern = my_words, sent, ignore.case = T)]

using this code I have this warning message

Warning message:
In grep(pattern = my_words, sent, ignore.case = T) :
  argument 'pattern' has length > 1 and only the first element will be used

How I can avoid this? I want to analyze all words of my vector. I looked at stringr package but I couldn't find a solution.

The code can change anyway, I just showed what I've done!

Ronak Shah · Accepted Answer

You can create a regex pattern from my_words and use it in grep.

my_words <- tolower(c("Pro", "Pra", "Pri", "Pre", "Pru", 
                      "Sab", "Seb", "Sib", "Sob", "Sub"))
sent <- unlist(strsplit(my_text, "\."))
grep(paste0('\b', my_words, '\b', collapse = '|'), sent, ignore.case = TRUE, value = TRUE)

#[1] "pro is a molecule that can be found in the air"     
#[2] " for flying, bees need another molecule that is sub"
#[3] " sub is activated and so sub is a substrate"        
#[4] " pro activate sub"                                  
#[5] " this means that sub is catalyzed by pro"

I have included word boundaries (\b) so that only complete word matches. For example, 'pre' will not match with 'spread'.

find words in a text divided in sentences R

Answers (2)

Related Questions