Tshabat
Tshabat

Reputation: 59

Count how many keywords from a list occur in a document

I have recently created a function that is able to count the frequency of occurrences of certain keywords within a document with parallel processing.

Now I would like to adjust the code so that it does not count how many times all keywords appear in the documents but instead how many keywords appear in a document.

Reproductible example:

keywords <- c("Toyota", "Prius", "BMW", "M3")
documents <- c("New Toyota Prius for sale, the Toyota Prius is in good condition","BMW M3 that drives like a Toyota but is a BMW")

count_strings <- function(x, words){sum(unlist(strsplit(x, ' ')) %in% words)}

library(parallel)
mcluster <- makeCluster(detectCores())
number_of_keywords <- parSapply(mcluster, documents, count_strings, keywords, USE.NAMES=F)
stopCluster(mcluster)

As instructed, the code currently counts the frequency of occurrences of the keywords in each document, which is 4,4.

But I would like to adjust my function so that the program counts the number of keywords that appear in each document. The correct answer should read 2,3.

Upvotes: 1

Views: 75

Answers (1)

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521914

Here is a base R option using apply and grepl:

keywords <- c("Toyota", "Prius", "BMW", "M3")
documents <- c("New Toyota Prius for sale, the Toyota Prius is in good condition","BMW M3 that drives like a Toyota but is a BMW")
keywords <- paste0("\\b", keywords, "\\b")
res <- sapply(keywords, function(x) grepl(x, documents))
rowSums(res)

[1] 2 3

Demo

Note there is a critical step above, where we wrap each keyword term in word boundaries. This will prevent false flag matches occurring from a keyword happening to be a substring of a larger word.

Upvotes: 1

Related Questions