Roger
Roger

Reputation: 719

Is there a way to prevent parallel::mclapply() accessing the contents of the global environment?

Can the R function parallel::mclapply() be made to run in a RAM-efficient way in interactive RStudio sessions in situations where large objects reside in the global environment?

I find that when I use mclapply() to run analyses on multiple cores, the RAM consumed is always substantially (tens of GB, in my case) higher when running in an interactive RStudio session than when I run the exact same code via Rscript. My hunch is that this is because mclapply() duplicates the global environment on each core (I often have objects tens of gigabytes in size residing in the global environment), and supplying only the essential objects to the Rscript minimises this overhead.

I am using Linux AWS EC2 machines with large amounts of RAM (e.g., 64 GB to 128 GB) and reasonably large numbers of CPU cores (e.g., 16–32), and I often find that running mclapply on detectCores() - 1 interactively maxes out the RAM almost instantly (increasing by many tens of GB in seconds), whereas running the exact same code via Rscript uses barely any more RAM than was consumed before mclapply() was called. I have observed this behaviour for a wide range of unrelated analyses --- hence the fact that I'm not including a reproducible example.

To run the mclapply call via Rscript, I first save the necessary data objects to an .rda file, and then use system() to run a script via Rscript that loads the data objects, runs the mclapply() call, and then saves the output to a file that can be loaded back into the interactive session.

Is this a widely-known problem? If the problem is because of mclapply copying the global environment on each core, is there a way to ensure that it can only access the variables necessary for the analysis?

Upvotes: 0

Views: 298

Answers (0)

Related Questions