Reputation: 35
I am trying to convert a for loop which I am currently using to run a process across a large matrix. The current for loop finds the maximum value within a 30 x 30 section and creates a new matrix with the maximum value.
The current code for the for loop looks like this:
mat <- as.matrix(CHM) # CHM is the original raster image
maxm <- matrix(nrow=nrow(mat)/30, ncol=ncol(mat)/30) # create new matrix with new dimensions
for(i in 1:dim(maxm)[1]) {
for(j in 1:dim(maxm)[2]) {
row <- 30 * (i - 1) + 1
col <- 30 * (j - 1) + 1
maxm[i,j] <- max(CHM[row:(row + 29), col:(col + 29)])
}
}
I want to convert this to a foreach loop to use parallel processing. I've got as far as producing the following code but this dosent work. I'm not sure how to produce the new matrix within the foreach loop:
ro<-nrow(mat)/30
co<-ncol(mat)/30
maxm <- matrix(nrow=nrow(mat)/30, ncol=ncol(mat)/30)
foreach(i=ro, .combine='cbind') %:%
foreach(j=co, .combine='c') %dopar% {
row <- 30 * (i - 1) + 1
col <- 30 * (j - 1) + 1
maxm[i,j]<-(max(CHM[row:(row + 29), col:(col + 29)]))
}
Any suggestions please!
Upvotes: 2
Views: 1787
Reputation: 8572
Prior to performing any action in parallel, one should try to see if any vectorizing is possible. And once that is done question 'is parallelization reasonable?'
In this specific example, parallelization is unlikely to be as fast as you expect, as at each iteration you are saving your output into a common object. R does not commonly support this in parallelization, and instead one should seek parallelization in the so called 'embarrassingly parallel-able' problems, until one gets a better understanding of how parallel problems work. In short: Don't perform parallel changes to data in R, unless you know what you're doing. It is unlikely to be faster.
That said in your case it actually becomes quite tricky. You seem to be performing a 'rolling-max window', and the output should be saved in a combined matrix. An alternative method to saving the data directly int othe matrix, is to return a matrix with 3 columns x
, i
, j
, where the latter two are indices that indicate which row/column the value of x
should be placed in.
In order for this to work, as Dmitriy noted in his answer, the data needs to be exported to each cluster
(parallel session), such that we can use it. Afterwards the following example shows how one can perform the parallization
First: Create a cluster and export the dataset
set.seed(1)
#Generate test example
n <- 3000
dat <- matrix(runif(n^2), ncol = n)
library(foreach)
library(doParallel)
#Create cluster
cl <- parallel::makeCluster(parallel::detectCores())
#Register it for the foreach loop
doParallel::registerDoParallel(cl)
#Export the dataset (could be done directly in the foreach, but this is more explicit)
parallel::clusterExport(cl, "dat")
Next we come to the foreach
loop. Note that according to the documentation, nested foreach
loops should be seperated using the %:%
tag, as shown in my example below:
output <- foreach(i = 1:(nrow(dat)/30), .combine = rbind, .inorder = FALSE) %:%
foreach(j = 1:(ncol(dat)/30), .combine = rbind, .inorder = FALSE) %dopar%{
row <- 30 * (i - 1) + 1
col <- 30 * (j - 1) + 1
c(x = max(dat[row:(row + 29), col:(col + 29)]), i = i, j = j)
}
Note the .inorder = FALSE
. As i return the indices i dont care about order, only about speed.
Last but not least, we need to create the matrix. The Matrix
package function Matrix::SparseMatrix
allows for specifying values and indices.
output <- Matrix::sparseMatrix(output[,"i"], output[,"j"], x = output[,"x"])
This is still rather slow. For n = 3000
it took roughly 6 seconds to perform calculations + a not-insignificant overhead from exporting the data. But it is likely faster than the same method using sequential loops.
Upvotes: 3
Reputation: 939
Let me try to get an answer here.
As I know, R use cluster system for parallel computation, each node works with an own environment. So, foreach-%dopar%, firstly, copy all current .globalEnv to the each cluster node and after that tried to run your code which is written in the cycle body. With no backcopy after code execution. You'll just get only a result by result = foreach(...) { }
. So, the code maxm[i,j]<-(max(CHM[row:(row + 29), col:(col + 29)]))
in the each node changes only local copy of your matrix, nothing more.
So, the "correct" code, probably, will be like this:
mat <- as.matrix(CHM);
ro<-nrow(mat)/30;
co<-ncol(mat)/30;
maxm = foreach(i=1:ro, .combine='cbind') %:%
{
result = foreach(j = 1:co, .combine='c') %dopar%
{
row <- 30 * (i - 1) + 1;
col <- 30 * (j - 1) + 1;
max(CHM[row:(row + 29), col:(col + 29)]);
}
result;
}
Maybe it also be need to use as.matrix
for maxm.
Upvotes: 0