Reputation: 97
Hello there I have a text and I'd like to retrieve only the sentences that contains certain words. Here is an example.
my_text<- tolower(c("Pro is a molecule that can be found in the air. This molecule spreads glitter and allows bees to fly over the rainbow. For flying, bees need another molecule that is Sub. Sub is activated and so Sub is a substrate. After eating that molecule bees become very speed and they can fly highly. Pro activate Sub. This means that Sub is catalyzed by Pro."))
my_words <- tolower(c("Pro", "Pra", "Pri", "Pre", "Pru",
"Sab", "Seb", "Sib", "Sob", "Sub"))
sent <- unlist(strsplit(my_text, "\\."))
sent <- sent[grep(pattern = my_words, sent, ignore.case = T)]
using this code I have this warning message
Warning message:
In grep(pattern = my_words, sent, ignore.case = T) :
argument 'pattern' has length > 1 and only the first element will be used
How I can avoid this? I want to analyze all words of my vector. I looked at stringr package but I couldn't find a solution.
The code can change anyway, I just showed what I've done!
Upvotes: 0
Views: 368
Reputation: 21400
You can define the words you're looking for as an alternation pattern, with \\b
wrapped around them to make sure they are matched only when occurring as words (and not as parts of other words, such as pro --> professional) and input that pattern into the subsetting method you have used in your post.
I'd also recommend that you use trimws
to, well, trim the whitespace:
sent <- trimws(unlist(strsplit(my_text, "\\.")))
pattern <- paste0("\\b", my_words, "\\b", collapse = "|")
sent[grepl(pattern, sent)]
You mention the stringr
package. The solution based on str_detect
would be:
sent[str_detect(sent, pattern)]
Upvotes: 0
Reputation: 388982
You can create a regex pattern from my_words
and use it in grep
.
my_words <- tolower(c("Pro", "Pra", "Pri", "Pre", "Pru",
"Sab", "Seb", "Sib", "Sob", "Sub"))
sent <- unlist(strsplit(my_text, "\\."))
grep(paste0('\\b', my_words, '\\b', collapse = '|'), sent, ignore.case = TRUE, value = TRUE)
#[1] "pro is a molecule that can be found in the air"
#[2] " for flying, bees need another molecule that is sub"
#[3] " sub is activated and so sub is a substrate"
#[4] " pro activate sub"
#[5] " this means that sub is catalyzed by pro"
I have included word boundaries (\\b
) so that only complete word matches. For example, 'pre'
will not match with 'spread'
.
Upvotes: 1