Reputation: 489
suppose I have 8 cores on my computer. I have loaded a 2Go dataset on RAM and I want each one of these workers to read only from that dataset what I do:
worker.function(rowstoread, dataset)
{
#read a couple of rows from the dataset (those rows are sent as argument to the worker function)
#process these rows
#return results
}
I was wondering why this would incur a copy of the dataset at the level of each worker since my workers are only reading from the dataset. They are not modifying anything in the dataset.
Is there any fix to that or is this inherent to R? Also would this problem be alleviated if I use a Linux machine instead or would a copy of the dataset still occur at the level of each worker ?
Upvotes: 2
Views: 551
Reputation: 25454
TL;DR: This can work much better on Linux.
There are two problems here:
R is single-threaded and only knows parallelism at the process level.
Windows doesn't have a "fork" system call, unlike Linux.
If you are on Linux and use a forking-based parallelization backend (e.g., parallel::makeForkCluster()
), you may be able to access the dataset in the workers without reloading/copying it.
Modern operating systems support multiple threads per process, all of which have access to the same data. All threads in a process must ensure that concurrent data access always leaves the memory in a consistent state, even if multiple threads update the same location. This is usually done by locking mechanisms, but is also non-trivial to implement. Some parts of R (e.g., if I remember correctly, the memory allocator) are inherently single-threaded, and so must be is all (interpreted) R code. The only way to work in parallel with R is to spawn multiple processes.
Each new process on Windows starts "empty" and must load its code and data from external storage. On the other hand, Linux has a "fork" system call, which allows creating a second process that starts with exactly the same memory contents (code and data) as the running process.
Upvotes: 3