Reputation: 139
I´m trying to parallelize a process that runs over the rows of a matrix. I expect that for each element of the row, which is a species, it extracts and writes a file (raster) corresponding to the distribution of each species over their habitats.
Habitas layer is a raster file and each species distribution is a polygon (or group of polygons) from a shapefile. I first convert the species polygon(s) to a raster, second extract the habitats for the species (stored in a matrix where the species habitats codes are matched with the habitats raster values), and finally intersect (multiply) the distribution and the habitats.
In addition, I would like to produce a richness (number of species map) file (raster). Then, I add (sum) to an empty raster (with values of zero) each final species distribution. I wrote the following function:
extract_habitats=function(k,spp_data,spp_polygons,sep,habitat_codes,cover)
{
#Libraries
library(rgdal)
library(raster)
#raster file with zeros
richness_cur=raster("richness_current.tif")
#Selection of species polygons
rows=as.numeric(which(as.character(spp_polygons@data$binomial)==
as.character(spp_data$binomial[k])))
spp_poly=spp_polygons[rows,]
#Covert polygon(s) to raster
spp_poly=rasterize(spp_poly,cover,1,background=0)
#Match species habitats codes with habitats raster values
habs=as.character(spp_data$hab_code[k])
habs=unlist(strsplit(habs, split=sep))#habitat codes are separeted by a ";"
cov_classes=as.numeric(as.character(habitat_codes[,2]#Get the hab
[which(as.character(habitat_codes[,1])%in%habs)]))
#Intersect species distributions with habitats
cov_mask=spp_poly*cover
#Extract species habitats
cov_mask=Which(cov_mask%in%cov_classes)
writeRaster(cov_mask,paste(spp_data$binomial[k]," current.tif",sep=""))
#Sum species richness
richness_cur=richness_cur+cov_mask
return (richness_cur)
}
I try to parallelize the process using clusterApply and foreach functions. However, I couldn't return a raster object from the function (which is something obvious to get in a regular loop function) in any of both functions to add to that object the sum of the species richness. So, here is my first question. 1. Does anyone know how to return an object different from a list, matrix, or vector in a parallelization process?
I solve this problem writing the richness file in each "iteration". Nevertheless, this option causes the process to be slower, so for me, it is not the ideal. Then, the function was rewritten as follows:
extract_habitats=function(k,spp_data,spp_polygons,sep,habitat_codes,cover)
{
#Libraries
library(rgdal)
library(raster)
#raster file with zeros
richness_cur=raster("richness_current.tif")
#Selection of species polygons
rows=as.numeric(which(as.character(spp_polygons@data$binomial)==
as.character(spp_data$binomial[k])))
spp_poly=spp_polygons[rows,]
#Covert polygon(s) to raster
spp_poly=rasterize(spp_poly,cover,1,background=0)
#Match species habitats codes with habitats raster values
habs=as.character(spp_data$hab_code[k])
habs=unlist(strsplit(habs, split=sep))#habitat codes are separeted by a ";"
cov_classes=as.numeric(as.character(habitat_codes[,2]#Get the hab
[which(as.character(habitat_codes[,1])%in%habs)]))
#Intersect species distributions with habitats
cov_mask=spp_poly*cover
#Extract species habitats
cov_mask=Which(cov_mask%in%cov_classes)
writeRaster(cov_mask,paste(spp_data$binomial[k]," current.tif",sep=""))
#Sum species richness
richness_cur=richness_cur+cov_mask
writeRaster(richness_cur,"richness_current.tif")
}
Complete code to run parallelization was:
#Number of cores
no_cores=detectCores()-1
#Initiate cluster
cl=makeCluster(no_cores,type="PSOCK")
registerDoParallel(cl)
#Table with name and habitat information (columns) for each species (rows)
spp_data=read.xlsx2("species_file.xls",sheetIndex=1)
#Shape file with species distributions as polygons
spp_polygons=readOGR("distributions.shp")
#Separation symbol for species habitats stored in spp_data
sep=";"
#Tabla joining habitas species codes with habitats raster
habitat_codes=read.xlsx2("spp_habitats_final.xls",sheetIndex=1)
#Habitats raster
cover=raster("Z:/Data/cover_2015_proj_fixed_reclas_1km.tif")
#Paralelization
foreach(k=1:nrow(spp_data)) %dopar% extract_habitats(k=k,
spp_data=spp_data,
spp_polygons=spp_polygons,sep=sep,
habitat_codes=habitat_codes,
cover=cover)
stopImplicitCluster()
stopCluster(cl)
The parallelization process runs; however, it didn't works how I expected, since it is not using all the cores: Image of processors working. So, what the parallelization process does, is to start 39 (the number of cores) processes: Image of processes opened, but it does't write one by one the files, how I could expect in a regular loop. It suddenly write blocks of 39 files (something that I can understand), but taking a lot of time (because it seems to work in few cores), even more than if I ran a regular loop (running the regular loop each file is written each two or three minutes, while the block of 39 files is written approximately each one hour).
So, here is my second group of questions. 2. What I´m doing bad? 3. Why it is not using all the 39 processors, or it uses them, why it doesn't use them at the maximum level? 4. Why it not begins another task when it finishes one (I suppose it because it always write the files in blocks of 39)?
Thanks in advance for your help.
Cheers,
Jaime
Upvotes: 1
Views: 909
Reputation: 9705
Does anyone know how to return an object different from a list, matrix, or vector in a parallelization process?
To your first question, it doesn't make sense. What kind of object do you want to return? A list can contain any R object.
Why it is not using all the 39 processors, or it uses them, why it doesn't use them at the maximum level?
There are lots of potential reasons. From looking at your code, one reason may be disk IO limited, since you are writing lots of images to disk. Another potential reason is memory size limitation.
What I´m doing bad?
If you are using Linux (or any non-windows) you should be using the mclapply function from the base R parallel package instead.
Upvotes: 1