Reputation: 2283
I have a very large data.frame
that I want to apply a fairly complicated function to, calculating a new column. I want to do it in parallel. This is similar to the question posted over on the r listserve, but the first answer is wrong and the second is unhelpful.
I've gotten everything figured out thanks to the parallel
package, except how to put the output back onto the data frame. Here's a MWE that shows what I've got:
library(parallel)
# Example Data
data <- data.frame(a = rnorm(200), b = rnorm(200),
group = sample(letters, 200, replace = TRUE))
# Break into list
datagroup <- split(data, factor(data$group))
# execute on each element in parallel
options(mc.cores = detectCores())
output <- mclapply(datagroup, function(x) x$a*x$b)
The result in output
is a list of numeric vectors. I need them in a column that I can append to data
. I've been looking along the lines of do.call(cbind, ...)
, but I have two lists with the same names, not a single list that I'm joining. melt(output)
gets me a single vector, but its rows are not in the same order as data
.
Upvotes: 8
Views: 4567
Reputation: 5252
Compute the mean by group using a multicore process:
library(dplyr)
x <- group_by(iris, Species)
indices <- attr(x,"indices")
labels <- attr(x,"labels")
require(parallel)
result <- mclapply(indices, function(indx){
data <- slice(iris, indx + 1)
## Do something...
mean(data$Petal.Length)
}, mc.cores =2)
out <- cbind(labels,mean=unlist(result))
out
## Species mean
## 1 setosa 1.462
## 2 versicolor 4.260
## 3 virginica 5.552
Upvotes: 1
Reputation: 21
A bit dated, but this might help.
rbind will kill you in terms of performance if you have many splits.
It's much faster to use the unsplit function.
results <- mclapply( split(data, data$group), function(x) x$a*x$b)
resultscombined <- unsplit (results, data$group)
data$newcol <- resultscombined
Yeah there's a memory hit so depends on what you'd like.
Upvotes: 2
Reputation: 2283
Inspired by @beginneR and our common love of dplyr
, I did some more fiddling and think the best way to make this happen is
rbind_all( mclapply(split(data, data$group), fun(x) as.data.frame(x$a*x$b)))
Upvotes: 0
Reputation: 70266
Converting from comment to answer..
This seems to work:
data <-
do.call(
rbind, mclapply(
split(data, data$group),
function(x){
z <- x$a*x$b
x <- as.data.frame(cbind(x, newcol = z))
return(x)
}))
rownames(data) <- seq_len(nrow(data))
head(data)
# a b group newcol
#1 -0.6482428 1.8136254 a -1.17566963
#2 0.4397603 1.3859759 a 0.60949714
#3 -0.6426944 1.5086339 a -0.96959055
#4 -1.2913493 -2.3984527 a 3.09724030
#5 0.2260140 0.1107935 a 0.02504087
#6 2.1555370 -0.7858066 a -1.69383520
Since you are working with a "very large" data.frame (how large roughly?), have you considered using either dplyr
or data.table
for what you do? For a large data set, performance may be even better with one of these than with mclapply
. The equivalent would be:
library(dplyr)
data %>%
group_by(group) %>%
mutate(newcol = a * b)
library(data.table)
setDT(data)[, newcol := a*b, by=group]
Upvotes: 9
Reputation: 4871
I'm currently unable to download the parallel
package to my computer. Here I post a solution that works for my usual setup using the snow
package for computation in parallel.
The solution simply orders the data.frame at the beginning, then merges the output list calling c()
. See below:
library(snow)
library(rlecuyer)
# Example data
data <- data.frame(a = rnorm(200), b = rnorm(200),
group = sample(letters, 200, replace = TRUE))
data <- data[order(data$group),]
# Cluster setup
clNode <- list(host="localhost")
localCl <- makeSOCKcluster(rep(clNode, 2))
clusterSetupRNG(localCl, type="RNGstream", seed=sample(0:9,6,replace=TRUE))
clusterExport(localCl, list=ls())
# Break into list
datagroup <- split(data, factor(data$group))
output <- clusterApply(localCl, datagroup, function(x){ x$a*x$b })
# Put back and check
data$output <- do.call(c, output)
data$check <- data$a*data$b
all(data$output==data$check)
# Stop cluster
stopCluster(localCl)
Upvotes: 0