Derek Corcoran
Derek Corcoran

Reputation: 4092

Error with parallelization in R for S4 objects

I am trying to optimize a function that I am going to do with a several rasters with millions of cells, so I want to parallelize this function.

The initial Raster

So this is the initial raster:

library(raster)
SPA <- raster(nrows=3, ncols=3, xmn = -10, xmx = -4, ymn = 4, ymx = 10)

values(SPA) <- c(0.1, 0.4, 0.6, 0, 0.2, 0.4, 0, 0.1, 0.2)

plot(SPA)

enter image description here

The objective of the function is to get a dataframe with the distance between all of the cells present in the raster with a column from, a column to, and a column distance.

Transition layer

in order to do that I create a transition layer using the gdistance package:

library(gdistance)
h16  <- transition(SPA, transitionFunction=function(x){1},16,symm=FALSE) 
h16   <- geoCorrection(h16, scl=FALSE)

and the origin points for every cell:

B <- xyFromCell(SPA, cell = 1:ncell(SPA))
head(B)

      x y
[1,] -9 9
[2,] -7 9
[3,] -5 9
[4,] -9 7
[5,] -7 7
[6,] -5 7

Distance function

With some help from some stackoverflow answers I made this function which is faster than the accCost one in gdistance

accCost2 <- function(x, fromCoords) {

  fromCells <- cellFromXY(x, fromCoords)
  tr <- transitionMatrix(x)
  tr <- rBind(tr, rep(0, nrow(tr)))
  tr <- cBind(tr, rep(0, nrow(tr)))
  startNode <- nrow(tr)
  adjP <- cbind(rep(startNode, times = length(fromCells)), fromCells)
  tr[adjP] <- Inf
  adjacencyGraph <- graph.adjacency(tr, mode = "directed", weighted = TRUE)
  E(adjacencyGraph)$weight <- 1/E(adjacencyGraph)$weight
  return(shortest.paths(adjacencyGraph, v = startNode, mode = "out")[-startNode])
}

What I want to parallelize

And using apply I get my desired data.frame

connections <- data.frame(from = rep(1:nrow(B), each = nrow(B)),to = rep(1:nrow(B), nrow(B)), dist =as.vector(apply(B,1, accCost2, x = h16)))

head(connections)

  from to     dist
1    1  1      0.0
2    1  2 219915.7
3    1  3 439831.3
4    1  4 221191.8
5    1  5 312305.7
6    1  6 493316.1

This is what I tried with parApply

library("parallel")
cl = makeCluster(3)
clusterExport(cl, c("B", "h16", "accCost2"))
clusterEvalQ(cl, library(gdistance), library(raster))

connections <- data.frame(from = rep(1:nrow(B), each = nrow(B)),to = rep(1:nrow(B), nrow(B)), dist =as.vector(parRapply(cl, B,1, accCost2, x = h16)))

stopCluster(cl)

But I get the following error:

Error in x[i, , drop = FALSE] : object of type 'S4' is not subsettable

I am fairly new in parallelization, and I am not sure what I am doing wrong

Upvotes: 1

Views: 285

Answers (1)

JacobVanEtten
JacobVanEtten

Reputation: 203

There are several syntax issues in your code.

This code works for me.

library("parallel") 

accCost_wrap <- function(x){accCost2(h16,x)}
#Instead of including h16 in the parRapply function, 
#just get it in the node environment    

cl = makeCluster(3)  

clusterExport(cl, c("h16", "accCost2")) 
#B will be "sent" to the nodes through the parRapply function.

clusterEvalQ(cl, {library(gdistance)}) 
#raster is a dependency of gdistance, so no need to include raster here.    

pp <- parRapply(cl, x=B, FUN=accCost_wrap) 

stopCluster(cl)

connections <- data.frame(from = rep(1:nrow(B), each = nrow(B)),  
to = rep(1:nrow(B), nrow(B)),  
dist = as.vector(pp))

Your version of accCost is indeed faster than the version in gdistance. Your version omits the checks to see if your points are within the extent of your transition layer. Proceed with caution.

(You could make your function even faster by taking the cell numbers as input. Also, sending so much data back from each node does not seem very efficient.)

Upvotes: 3

Related Questions