R - how to nest data frame over a group in parallel cores

Question

Any idea how I can run the following operation in parallel cores?

libraries and sample data

libs <- c("plyr", "dplyr", "tidyr")
sapply(libs, require, character.only = T)

set.seed(1)
df <- data.frame(id = sample(1:10, 100000, TRUE), value = runif(100000))

operation to run in parallel cores:

df %>% 
  group_by(id) %>% 
  nest()

Any help would be very appreciated!

dule arnaux · Accepted Answer

Using multiple cores to simply nest a data.frame wouldn't be efficient. So I assume you want to perform some other calculation. The example below calculates the summary, which will have several values for each group id.

The multidplyr package convenient for this kind of thing.

# replace plyr with multidplyr
libs <- c("dplyr", "tidyr",'multidplyr')
devtools::install_github("hadley/multidplyr")
sapply(libs, require, character.only = T)

set.seed(1)
df <- data.frame(id = sample(1:10, 100000, TRUE), 
                 value = runif(100000))%>%as.tbl

# first the single core solution. No need to nest, 
# since group_by%>%do() automatically nests.
x<-df%>% 
  group_by(id)%>%
  # nest()%>%
  do(stat_summary=summary(.$value)%>%as.matrix%>%t%>%data.frame%>%as.tbl)%>%
  ungroup  

# next, multiple core solution
n_cores<-2
cl<-multidplyr::create_cluster(n_cores)
# you have to load the packages into each cluster
cluster_library(cl,c('dplyr','tidyr')) 
df_mp<-df%>%multidplyr::partition(cluster = cl,id) # group by id

x_mp<-df_mp%>% 
  do(stat_summary=summary(.$value)%>%as.matrix%>%t%>%data.frame%>%as.tbl)%>%
  collect()%>%
  ungroup

Results match. You probably won't get much speed up unless your doing a calculation that is slower than loading the data to each different processes.

all.equal(unnest(x_mp),unnest(x))
x_mp

TRUE
# A tibble: 10 x 2
      id     stat_summary
              
 1     3 
 2     5 
 3     6 
 4     7 
 5     1 
 6     2 
 7     4 
 8     8 
 9     9 
10    10

R - how to nest data frame over a group in parallel cores

libraries and sample data

operation to run in parallel cores:

Answers (1)

Related Questions