Reputation: 851
I have a data frame with multiple IDs and I am trying to perform feature extraction on the different ID sets. The data looks like this:
id x y
1 3812 60 7
2 3812 63 105
3 3812 65 1000
4 3812 69 8
5 3812 75 88
6 3812 78 13
where id takes on about 200 different values. So I am trying to extract features from the (x,y) data, and I'd like to do it in parallel, since for some datasets, doing it sequentially can take about 20 minutes or so. Right now I am using dplyr as such:
x = d %>% group_by(id) %>% do(data.frame(getFeatures(., func_args))
where func_args
are just additional function inputs to the function getFeaures
. I am trying to use plyr::ldply
with parallel=TRUE
to do this, but there is a problem in that within getFeatures
, I am using another function that I've written. So, when I try to run parallel, I get an error:
Error in do.ply(i) :
task 1 failed - "could not find function "desparsify""
In addition: Warning messages:
1: <anonymous>: ... may be used in an incorrect context: ‘.fun(piece, ...)’
Error in do.ply(i) :
task 1 failed - "could not find function "desparsify""
where desparsify
is a custom function written to process the (x,y) data (it effectively adds zeros to x locations that are not present in the dataset). I get a similar error when I try to use the cosine
function from package lsa
. Is there a way to use parallel processing when calling external/non-base functions in R?
Upvotes: 0
Views: 720
Reputation: 6805
You don't show how you set up plyr to parallelize, but I think I can guess what you're doing. I also guess you're on Windows. Here's a teeny standalone example illustrating what's going on:
library(plyr)
## On Windows, doParallel::registerDoParallel(2) becomes:
cl <- parallel::makeCluster(2)
doParallel::registerDoParallel(cl)
desparsify <- function(x) sqrt(x)
y <- plyr::llply(1:3, function(x) desparsify(x), .parallel=TRUE)
## Error in do.ply(i) :
## task 1 failed - "could not find function "desparsify""
If you use doFuture instead of doParallel, the underlying future framework will make sure 'desparsify' is found, e.g.
library(plyr)
doFuture::registerDoFuture()
future::plan("multisession", workers = 2)
desparsify <- function(x) sqrt(x)
y <- plyr::llply(1:3, function(x) desparsify(x), .parallel=TRUE)
str(y)
## List of 3
## $ : num 1
## $ : num 1.41
## $ : num 1.73
(disclaimer: I'm the author of the future framework)
PS. Note that plyr is a legacy package no longer maintained. You might want to look into future.apply, furrr, or foreach with doFuture as alternatives for parallelization.
Upvotes: 2
Reputation: 321
There is. Take a look in the parApply functions family. I usually use the parLapply one.
You'll need to set the number of cores with cl <- makeCluster(number of cores)
and pass it, together with a vector of your ids (may depend on how your functions identify the entries for each id) and your functions, to parLapply to produce a list with the output of your function applied to each group in parallel.
cl <- makeCluster(number of cores)
ids=1:10
clusterExport(cl=cl,varlist=c('variable name','function name')) ## in case you need to export variable/functions
result=parLapply(cl=cl,ids, your function)
stopCluster(cl)
Upvotes: 1