mlachans
mlachans

Reputation: 49

Fast count of word matches in dictionary for vector of texts in R

I have a very long vector of brief texts in R (say, length 10 million). The first five items of the list are as follows:

  1. "I am an angry tiger."
  2. "I am unhappy clam."
  3. "I am an angry and unhappy tiger."
  4. "I am an angry, angry, tiger."
  5. "Beep boop."

I have a dictionary, which we will say is composed of the words "angry" and "unhappy".

What is the fastest way to obtain a count of matches from this dictionary on the vector of texts? In this case, the correct answer would be vector [1, 1, 2, 2, 0].

I have tried solutions involving quanteda and tm and basically they all fail because I cannot store any large document-feature matrix in memory. Bonus points for any solution using qdap, dplyr, and termco.

Upvotes: 4

Views: 812

Answers (2)

akrun
akrun

Reputation: 887501

We can use base R methods with regexpr and Reduce

Reduce(`+`, lapply(dict, function(x) lengths(regmatches(txt, gregexpr(x, txt)))))
#[1] 1 1 2 2 0

Or a faster approach would be

Reduce(`+`, lapply(dict, function(x) vapply(gregexpr(x, txt),
          function(y) sum(attr(y, "match.length")>0), 0)))
#[1] 1 1 2 2 0

NOTE: With large datasets and large number of dictionary elements, this method will not have any limitations.

data

txt <- c("I am an angry tiger.", "I am unhappy clam.", "I am an angry and unhappy tiger.", 
          "I am an angry, angry, tiger." ,"Beep boop.") 
dict <- c("angry", "unhappy")

Upvotes: 6

Sotos
Sotos

Reputation: 51592

Using stringi package,

library(stringi)
stri_count_regex(v1, paste(v2, collapse = '|'))
#[1] 1 1 2 2 0

DATA

dput(v1)
c("I am an angry tiger.", "I am unhappy clam.", "I am an angry and unhappy tiger.", 
"I am an angry, angry, tiger.", "Beep boop.")
dput(v2)
c("angry", "unhappy")

Upvotes: 8

Related Questions