Fast count of word matches in dictionary for vector of texts in R

Question

I have a very long vector of brief texts in R (say, length 10 million). The first five items of the list are as follows:

"I am an angry tiger."
"I am unhappy clam."
"I am an angry and unhappy tiger."
"I am an angry, angry, tiger."
"Beep boop."

I have a dictionary, which we will say is composed of the words "angry" and "unhappy".

What is the fastest way to obtain a count of matches from this dictionary on the vector of texts? In this case, the correct answer would be vector [1, 1, 2, 2, 0].

I have tried solutions involving quanteda and tm and basically they all fail because I cannot store any large document-feature matrix in memory. Bonus points for any solution using qdap, dplyr, and termco.

Sotos · Accepted Answer

Using stringi package,

library(stringi)
stri_count_regex(v1, paste(v2, collapse = '|'))
#[1] 1 1 2 2 0

DATA

dput(v1)
c("I am an angry tiger.", "I am unhappy clam.", "I am an angry and unhappy tiger.", 
"I am an angry, angry, tiger.", "Beep boop.")
dput(v2)
c("angry", "unhappy")

Fast count of word matches in dictionary for vector of texts in R

Answers (2)

data

Related Questions