CopyOfA
CopyOfA

Reputation: 851

Using plyr ldply parallel with function within function

I have a data frame with multiple IDs and I am trying to perform feature extraction on the different ID sets. The data looks like this:

    id  x    y 
1 3812 60    7
2 3812 63  105
3 3812 65 1000
4 3812 69    8
5 3812 75   88
6 3812 78   13

where id takes on about 200 different values. So I am trying to extract features from the (x,y) data, and I'd like to do it in parallel, since for some datasets, doing it sequentially can take about 20 minutes or so. Right now I am using dplyr as such:

x = d %>% group_by(id) %>% do(data.frame(getFeatures(., func_args))

where func_args are just additional function inputs to the function getFeaures. I am trying to use plyr::ldply with parallel=TRUE to do this, but there is a problem in that within getFeatures, I am using another function that I've written. So, when I try to run parallel, I get an error:

Error in do.ply(i) : 
  task 1 failed - "could not find function "desparsify""
In addition: Warning messages:
1: <anonymous>: ... may be used in an incorrect context: ‘.fun(piece, ...)’
 
 Error in do.ply(i) : 
  task 1 failed - "could not find function "desparsify""

where desparsify is a custom function written to process the (x,y) data (it effectively adds zeros to x locations that are not present in the dataset). I get a similar error when I try to use the cosine function from package lsa. Is there a way to use parallel processing when calling external/non-base functions in R?

Upvotes: 0

Views: 720

Answers (2)

HenrikB
HenrikB

Reputation: 6805

You don't show how you set up plyr to parallelize, but I think I can guess what you're doing. I also guess you're on Windows. Here's a teeny standalone example illustrating what's going on:

library(plyr)

## On Windows, doParallel::registerDoParallel(2) becomes:
cl <- parallel::makeCluster(2)
doParallel::registerDoParallel(cl)

desparsify <- function(x) sqrt(x)
y <- plyr::llply(1:3, function(x) desparsify(x), .parallel=TRUE)
## Error in do.ply(i) : 
##  task 1 failed - "could not find function "desparsify""

If you use doFuture instead of doParallel, the underlying future framework will make sure 'desparsify' is found, e.g.

library(plyr)

doFuture::registerDoFuture()
future::plan("multisession", workers = 2)

desparsify <- function(x) sqrt(x)
y <- plyr::llply(1:3, function(x) desparsify(x), .parallel=TRUE)
str(y)
## List of 3
##  $ : num 1
##  $ : num 1.41
##  $ : num 1.73

(disclaimer: I'm the author of the future framework)

PS. Note that plyr is a legacy package no longer maintained. You might want to look into future.apply, furrr, or foreach with doFuture as alternatives for parallelization.

Upvotes: 2

G.Fernandes
G.Fernandes

Reputation: 321

There is. Take a look in the parApply functions family. I usually use the parLapply one.

You'll need to set the number of cores with cl <- makeCluster(number of cores) and pass it, together with a vector of your ids (may depend on how your functions identify the entries for each id) and your functions, to parLapply to produce a list with the output of your function applied to each group in parallel.

cl <- makeCluster(number of cores)
ids=1:10               
clusterExport(cl=cl,varlist=c('variable name','function name')) ## in case you need to export variable/functions

result=parLapply(cl=cl,ids, your function)
stopCluster(cl)

Upvotes: 1

Related Questions