Reputation: 5467
I'm trying to figure out how to deploy the dplyr::do
function in parallel. After reading some the docs it seems that the dplyr::init_cluster() should be sufficient for telling the do() to run in parallel. Unfortunately this doesn't seem to be the case when I test this:
library(dplyr)
test <- data_frame(a=1:3, b=letters[c(1:2, 1)])
init_cluster()
system.time({
test %>%
group_by(b) %>%
do({
Sys.sleep(3)
data_frame(c = rep(max(.$a), times = max(.$a)))
})
})
stop_cluster()
Gives this output:
Initialising 2 core cluster.
|==========================================================================|100% ~0 s remaining
user system elapsed
0.03 0.00 6.03
I would expect it to be 3 if the do call was split between the two cores. I can also confirm this by adding a print to the do() that prints in the main R-terminal. What am I missing here?
I'm using dplyr 0.4.2 with R 3.2.1
Upvotes: 19
Views: 7532
Reputation: 21621
As per mentionned by @Maciej, you could try multidplyr
:
## Install from github
devtools::install_github("hadley/multidplyr")
Use partition()
to split your dataset across multiples cores:
library(dplyr)
library(multidplyr)
test <- data_frame(a=1:3, b=letters[c(1:2, 1)])
test1 <- partition(test, a)
You'll initialize a 3 cores cluster (one for each a
)
# Initialising 3 core cluster.
Then simply perform your do()
call:
test1 %>%
do({
dplyr::data_frame(c = rep(max(.$a)), times = max(.$a))
})
Which gives:
#Source: party_df [3 x 3]
#Groups: a
#Shards: 3 [1--1 rows]
#
# a c times
# (int) (int) (int)
#1 1 1 1
#2 2 2 2
#3 3 3 3
Upvotes: 26
Reputation: 676
According to https://twitter.com/cboettig/status/588068454239830017 this feature does not seem to be currently supported.
Upvotes: 5