Reputation: 111
I'm using doMPI in R to parallelize saving netCDF climate data. This data is stored in R in a 4-dimensional matrix m
, with data for 6 variables, at 20000 timepoints, over a latitude and longitude grid. m
is thus indexed as m[lon,lat,time,variable]
. Based on how netCDF stores its data on disk, the most efficient way to write data to disk is by timeslice. Consequently, I'd like to iterate over m
one timeslice at a time, for each variable. Currently, my code looks like this:
ntime <- 20000
output.vars <- list("rainfall", "snowfallwateq", "snowmelt", "newsnow", "snowdepth", "swe")
for (var.index in seq_along(output.vars)) {
ncout <- nc_open(output.files[var.index], write=TRUE)
val <- foreach(time.index=1:ntime, .packages=c("ncdf4")) %dopar%
{
ncvar_put(ncout, output.vars[[var.index]],
vals=m[,,time.index,var.index],
start=c(1, 1, time.index),
count=c(nlon, nlat, 1))
}
nc_close(ncout)
}
This unnecessarily copies the entire m
matrix to each worker. That takes up a prohibitively large amount of memory, and I need to reduce the amount of copied data. What I thought of from this answer was that I could iterate over each timeslice of the matrix, so only the data for the timeslice was copied to each worker at each iteration. The foreach
construct allows multiple objects to iterate simultaneously, so I could even have the time index alongside the matrix timeslice without a problem. Unfortunately, I don't know of any way to iterate over a matrix by-timeslice. Is there a way to do so, such that at each iteration t
of the foreach
loop for variable var
, I can have a variable data
which holds the 2-dimensional matrix m[,,t,var]
?
I've already tried the intuitive approach below, but it iterates over each individual element, rather than an entire timeslice at a time.
val <- foreach(time.index=1:ntime, slice=m[,,,var], ...
Upvotes: 1
Views: 225
Reputation: 5059
If you can massage your data in the main R process,
you could try transforming each 2-dimensional slice into a big.matrix
from the bigmemory
package,
and use that in your parallel workers.
This would only be useful if the time required to process each slice in the slave process is significant.
See this example, and notice that you can nest 2 foreach
loops with %:%
m <- as.numeric(1:16)
dim(m) <- rep(2L, 4L)
# use %do% for sequential processing, without copying the data to parallel workers
big_m <- foreach(i=1L:2L, .combine=c) %:% foreach(j=1L:2L, .combine=list) %do% {
as.big.matrix(m[,,i,j], type="double")
}
descriptors <- lapply(big_m, describe)
# specify .noexport to avoid copying the data to each worker
foreach(m_slice_desc=descriptors, .packages=c("bigmemory"), .noexport=ls(all.names=TRUE)) %dopar% {
# you could even modify the slices in parallel if you wanted
m_slice <- attach.big.matrix(m_slice_desc)
for (i in 1L:2L) {
for (j in 1L:2L) {
m_slice[i,j] <- m_slice[i,j] * 2
}
}
# return nothing
NULL
}
# just to show that the values were modified in place
for (bm in big_m) { print(bm[,]) }
[,1] [,2]
[1,] 2 6
[2,] 4 8
[,1] [,2]
[1,] 18 22
[2,] 20 24
[,1] [,2]
[1,] 10 14
[2,] 12 16
[,1] [,2]
[1,] 26 30
[2,] 28 32
If you can't/won't use bigmemory
,
or if processing each 2-dimensional slice is too fast
(yes, this may be problematic for multi-processing,
see this answer),
maybe you can extract 3-dimensional slices from your data and use .noexport
to only copy one at a time,
something like:
slices_3d <- lapply(1L:2L, function(i) { m[,,,i] })
foreach(slice_3d=slices_3d, .noexport=ls(all.names=TRUE)) %dopar% {
for (j in 1L:2L) {
slice_2d <- slice_3d[,,j]
# do something
}
# return nothing
NULL
}
I'm actually not 100% sure the above would prevent copying the whole slices_3d
,
if not, you may have to manually extract subsets in chunks in the main R process
(e.g. slices_3d[1L:num_parallel_workers]
and so on each time),
and make sure that only one chunk is exported in each call to foreach
.
Upvotes: 0