gregmacfarlane
gregmacfarlane

Reputation: 2283

putting `mclapply` results back onto data.frame

I have a very large data.frame that I want to apply a fairly complicated function to, calculating a new column. I want to do it in parallel. This is similar to the question posted over on the r listserve, but the first answer is wrong and the second is unhelpful.

I've gotten everything figured out thanks to the parallel package, except how to put the output back onto the data frame. Here's a MWE that shows what I've got:

library(parallel)

# Example Data
data <- data.frame(a = rnorm(200), b = rnorm(200),  
                   group = sample(letters, 200, replace = TRUE))

# Break into list
datagroup <- split(data, factor(data$group))

# execute on each element in parallel
options(mc.cores = detectCores())
output <- mclapply(datagroup, function(x) x$a*x$b)

The result in output is a list of numeric vectors. I need them in a column that I can append to data. I've been looking along the lines of do.call(cbind, ...), but I have two lists with the same names, not a single list that I'm joining. melt(output) gets me a single vector, but its rows are not in the same order as data.

Upvotes: 8

Views: 4567

Answers (5)

Emilio Torres Manzanera
Emilio Torres Manzanera

Reputation: 5252

Compute the mean by group using a multicore process:

library(dplyr)
x <- group_by(iris, Species)
indices <- attr(x,"indices")
labels <- attr(x,"labels") 

require(parallel)
result <- mclapply(indices,  function(indx){
                   data <- slice(iris, indx + 1)
                   ## Do something...
                   mean(data$Petal.Length)
                   }, mc.cores =2)

 out <- cbind(labels,mean=unlist(result))
 out
 ##      Species  mean
 ## 1     setosa 1.462
 ## 2 versicolor 4.260
 ## 3  virginica 5.552

Upvotes: 1

sonamine
sonamine

Reputation: 21

A bit dated, but this might help.

rbind will kill you in terms of performance if you have many splits.

It's much faster to use the unsplit function.

results <- mclapply( split(data, data$group), function(x) x$a*x$b) 

resultscombined <- unsplit (results, data$group)

data$newcol <-  resultscombined 

Yeah there's a memory hit so depends on what you'd like.

Upvotes: 2

gregmacfarlane
gregmacfarlane

Reputation: 2283

Inspired by @beginneR and our common love of dplyr, I did some more fiddling and think the best way to make this happen is

 rbind_all( mclapply(split(data, data$group), fun(x) as.data.frame(x$a*x$b)))

Upvotes: 0

talat
talat

Reputation: 70266

Converting from comment to answer..

This seems to work:

data <- 
  do.call(
    rbind, mclapply(
      split(data, data$group), 
       function(x){
         z <- x$a*x$b
         x <- as.data.frame(cbind(x, newcol = z))
         return(x)
         }))
rownames(data) <- seq_len(nrow(data))
head(data)
#           a          b group      newcol
#1 -0.6482428  1.8136254     a -1.17566963
#2  0.4397603  1.3859759     a  0.60949714
#3 -0.6426944  1.5086339     a -0.96959055
#4 -1.2913493 -2.3984527     a  3.09724030
#5  0.2260140  0.1107935     a  0.02504087
#6  2.1555370 -0.7858066     a -1.69383520

Since you are working with a "very large" data.frame (how large roughly?), have you considered using either dplyr or data.table for what you do? For a large data set, performance may be even better with one of these than with mclapply. The equivalent would be:

library(dplyr)
data %>%
  group_by(group) %>%
  mutate(newcol = a * b)

library(data.table) 
setDT(data)[, newcol := a*b, by=group]

Upvotes: 9

SimonG
SimonG

Reputation: 4871

I'm currently unable to download the parallel package to my computer. Here I post a solution that works for my usual setup using the snow package for computation in parallel.

The solution simply orders the data.frame at the beginning, then merges the output list calling c(). See below:

library(snow)
library(rlecuyer)

# Example data
data <- data.frame(a = rnorm(200), b = rnorm(200),  
                   group = sample(letters, 200, replace = TRUE))
data <- data[order(data$group),]

# Cluster setup
clNode <- list(host="localhost")
localCl <- makeSOCKcluster(rep(clNode, 2))
clusterSetupRNG(localCl, type="RNGstream", seed=sample(0:9,6,replace=TRUE))
clusterExport(localCl, list=ls())

# Break into list
datagroup <- split(data, factor(data$group))

output <- clusterApply(localCl, datagroup, function(x){ x$a*x$b })

# Put back and check
data$output <- do.call(c, output)
data$check <- data$a*data$b

all(data$output==data$check)

# Stop cluster
stopCluster(localCl)

Upvotes: 0

Related Questions