Reputation: 59
I have quite a big set of keywords that i need to compare to an even bigger corpus of documents and count the number of occurrences.
Since the calculation takes hours, I decided to try parallel processing. On this forum, I found the mclapply function of the parallel package, which seems to be helpful.
Being very new to R, I could not get the code working (see below for a short version). More specifically, I got the error:
"Error in get(as.character(FUN), mode = "function", envir = envir) : object 'FUN' of mode 'function' was not found"
rm(list=ls())
df <- c("honda civic 1988 with new lights","toyota auris 4x4 140000 km","nissan skyline 2.0 159000 km")
keywords <- c("honda","civic","toyota","auris","nissan","skyline","1988","1400","159")
countstrings <- function(x){str_count(x, paste(sprintf("\\b%s\\b", keywords), collapse = '|'))}
# Normal way with one processor
number_of_keywords <- countstrings(df)
# Result: [1] 3 2 2
# Attempt at parallel processing
library(stringr)
library(parallel)
no_cores <- detectCores() - 1
cl <- makeCluster(no_cores)
number_of_keywords <- mclapply(cl, countstrings(df))
stopCluster(cl)
#Error in get(as.character(FUN), mode = "function", envir = envir) :
#object 'FUN' of mode 'function' was not found
Any help is apprechiated!
Upvotes: 1
Views: 3580
Reputation: 21749
This function should be faster. Here's an alternate way of using parallel processing using parSapply
(this returns a vector instead of list):
# function to count
count_strings <- function(x, words)
{
sum(unlist(strsplit(x, ' ')) %in% words)
}
library(stringr)
library(parallel)
mcluster <- makecluster(detectCores()) # using all cores
number_of_keywords <- parSapply(mcluster, df, count_strings, keywords, USE.NAMES=F)
[1] 3 2 2
Upvotes: 1