Reputation: 49
I have a very long vector of brief texts in R (say, length 10 million). The first five items of the list are as follows:
I have a dictionary, which we will say is composed of the words "angry" and "unhappy".
What is the fastest way to obtain a count of matches from this dictionary on the vector of texts? In this case, the correct answer would be vector [1, 1, 2, 2, 0]
.
I have tried solutions involving quanteda
and tm
and basically they all fail because I cannot store any large document-feature matrix in memory. Bonus points for any solution using qdap
, dplyr
, and termco
.
Upvotes: 4
Views: 812
Reputation: 887501
We can use base R
methods with regexpr
and Reduce
Reduce(`+`, lapply(dict, function(x) lengths(regmatches(txt, gregexpr(x, txt)))))
#[1] 1 1 2 2 0
Or a faster approach would be
Reduce(`+`, lapply(dict, function(x) vapply(gregexpr(x, txt),
function(y) sum(attr(y, "match.length")>0), 0)))
#[1] 1 1 2 2 0
NOTE: With large datasets and large number of dictionary elements, this method will not have any limitations.
txt <- c("I am an angry tiger.", "I am unhappy clam.", "I am an angry and unhappy tiger.",
"I am an angry, angry, tiger." ,"Beep boop.")
dict <- c("angry", "unhappy")
Upvotes: 6
Reputation: 51592
Using stringi
package,
library(stringi)
stri_count_regex(v1, paste(v2, collapse = '|'))
#[1] 1 1 2 2 0
DATA
dput(v1)
c("I am an angry tiger.", "I am unhappy clam.", "I am an angry and unhappy tiger.",
"I am an angry, angry, tiger.", "Beep boop.")
dput(v2)
c("angry", "unhappy")
Upvotes: 8