Reputation: 59
I have recently created a function that is able to count the frequency of occurrences of certain keywords within a document with parallel processing.
Now I would like to adjust the code so that it does not count how many times all keywords appear in the documents but instead how many keywords appear in a document.
Reproductible example:
keywords <- c("Toyota", "Prius", "BMW", "M3")
documents <- c("New Toyota Prius for sale, the Toyota Prius is in good condition","BMW M3 that drives like a Toyota but is a BMW")
count_strings <- function(x, words){sum(unlist(strsplit(x, ' ')) %in% words)}
library(parallel)
mcluster <- makeCluster(detectCores())
number_of_keywords <- parSapply(mcluster, documents, count_strings, keywords, USE.NAMES=F)
stopCluster(mcluster)
As instructed, the code currently counts the frequency of occurrences of the keywords in each document, which is 4,4.
But I would like to adjust my function so that the program counts the number of keywords that appear in each document. The correct answer should read 2,3.
Upvotes: 1
Views: 75
Reputation: 521914
Here is a base R option using apply and grepl:
keywords <- c("Toyota", "Prius", "BMW", "M3")
documents <- c("New Toyota Prius for sale, the Toyota Prius is in good condition","BMW M3 that drives like a Toyota but is a BMW")
keywords <- paste0("\\b", keywords, "\\b")
res <- sapply(keywords, function(x) grepl(x, documents))
rowSums(res)
[1] 2 3
Note there is a critical step above, where we wrap each keyword term in word boundaries. This will prevent false flag matches occurring from a keyword happening to be a substring of a larger word.
Upvotes: 1