jf328
jf328

Reputation: 7351

R parallel shared memory object (windows)

I have a big data.table. Each parallel process reads from it, processes the data and returns a much smaller data.table. I don't want the big DT to be copied to all processes, but seems the %dopar% function in foreach package has to copy.

Is there a way to have the object shared across all processes (in windows)? That is, by using a package other than foreach.

Example code

library(doParallel)
cluster = makeCluster(4)
registerDoParallel(cluster)

M = 1e4 # make this larger 
dt = data.table(x = rep(LETTERS, M), y = rnorm(26*M))
res = foreach(trim = seq(0.6, 0.95, 0.05), .combine = rbind) %dopar% {
  dt[, .(trimmean = mean(y, trim = trim)), by = x][, trim := trim]
}

(I'm not interested in a better way of doing this in data.table without using parallel. This is just to show the case that subprocesses need to read all the data to process, but never change it)

Upvotes: 3

Views: 1319

Answers (1)

Steve Weston
Steve Weston

Reputation: 19677

Since R isn't multithreaded, parallel workers are implemented as processes in the various parallel programming packages. One of the features of processes is that their memory is protected from other processes, so programs have to use special mechanisms to share memory between different processes, such as memory mapped files. Since R doesn't have direct, builtin support for any such mechanism, packages such as "bigmemory" have been written that let you create objects that can be shared between different processes. Unfortunately, the "data.table" package doesn't support such a mechanism, so I don't think there is a way to do what you want.

Note that memory can be "read-only" shared between a process and a forked child process on Posix operating systems (such as Mac OS X and Linux), so you could sort of do what you want using the "doMC" backend, but that doesn't work on Windows, of course.

Upvotes: 3

Related Questions