Parallel processing in R doesn´t use all cores

Question

I´m trying to parallelize a process that runs over the rows of a matrix. I expect that for each element of the row, which is a species, it extracts and writes a file (raster) corresponding to the distribution of each species over their habitats.

Habitas layer is a raster file and each species distribution is a polygon (or group of polygons) from a shapefile. I first convert the species polygon(s) to a raster, second extract the habitats for the species (stored in a matrix where the species habitats codes are matched with the habitats raster values), and finally intersect (multiply) the distribution and the habitats.

In addition, I would like to produce a richness (number of species map) file (raster). Then, I add (sum) to an empty raster (with values of zero) each final species distribution. I wrote the following function:

extract_habitats=function(k,spp_data,spp_polygons,sep,habitat_codes,cover)
{
  #Libraries
  library(rgdal)
  library(raster)
  #raster file with zeros
  richness_cur=raster("richness_current.tif")
  #Selection of species polygons
  rows=as.numeric(which(as.character(spp_polygons@data$binomial)==
                          as.character(spp_data$binomial[k])))
  spp_poly=spp_polygons[rows,]
  #Covert polygon(s) to raster
  spp_poly=rasterize(spp_poly,cover,1,background=0)
  #Match species habitats codes with habitats raster values
  habs=as.character(spp_data$hab_code[k])
  habs=unlist(strsplit(habs, split=sep))#habitat codes are separeted by a ";"
  cov_classes=as.numeric(as.character(habitat_codes[,2]#Get the hab
                                      [which(as.character(habitat_codes[,1])%in%habs)]))
  #Intersect species distributions with habitats
  cov_mask=spp_poly*cover
  #Extract species habitats
  cov_mask=Which(cov_mask%in%cov_classes)
  writeRaster(cov_mask,paste(spp_data$binomial[k]," current.tif",sep=""))
  #Sum species richness
  richness_cur=richness_cur+cov_mask
  return (richness_cur)
}

I try to parallelize the process using clusterApply and foreach functions. However, I couldn't return a raster object from the function (which is something obvious to get in a regular loop function) in any of both functions to add to that object the sum of the species richness. So, here is my first question. 1. Does anyone know how to return an object different from a list, matrix, or vector in a parallelization process?

I solve this problem writing the richness file in each "iteration". Nevertheless, this option causes the process to be slower, so for me, it is not the ideal. Then, the function was rewritten as follows:

extract_habitats=function(k,spp_data,spp_polygons,sep,habitat_codes,cover)
{
  #Libraries
  library(rgdal)
  library(raster)
  #raster file with zeros
  richness_cur=raster("richness_current.tif")
  #Selection of species polygons
  rows=as.numeric(which(as.character(spp_polygons@data$binomial)==
                          as.character(spp_data$binomial[k])))
  spp_poly=spp_polygons[rows,]
  #Covert polygon(s) to raster
  spp_poly=rasterize(spp_poly,cover,1,background=0)
  #Match species habitats codes with habitats raster values
  habs=as.character(spp_data$hab_code[k])
  habs=unlist(strsplit(habs, split=sep))#habitat codes are separeted by a ";"
  cov_classes=as.numeric(as.character(habitat_codes[,2]#Get the hab
                                      [which(as.character(habitat_codes[,1])%in%habs)]))
  #Intersect species distributions with habitats
  cov_mask=spp_poly*cover
  #Extract species habitats
  cov_mask=Which(cov_mask%in%cov_classes)
  writeRaster(cov_mask,paste(spp_data$binomial[k]," current.tif",sep=""))
  #Sum species richness
  richness_cur=richness_cur+cov_mask
  writeRaster(richness_cur,"richness_current.tif")
}

Complete code to run parallelization was:

#Number of cores
no_cores=detectCores()-1
#Initiate cluster
cl=makeCluster(no_cores,type="PSOCK")
registerDoParallel(cl)

#Table with name and habitat information (columns) for each species (rows)
spp_data=read.xlsx2("species_file.xls",sheetIndex=1)
#Shape file with species distributions as polygons
spp_polygons=readOGR("distributions.shp")
#Separation symbol for species habitats stored in spp_data
sep=";"
#Tabla joining habitas species codes with habitats raster
habitat_codes=read.xlsx2("spp_habitats_final.xls",sheetIndex=1)
#Habitats raster
cover=raster("Z:/Data/cover_2015_proj_fixed_reclas_1km.tif")

#Paralelization
foreach(k=1:nrow(spp_data)) %dopar% extract_habitats(k=k,
                                                     spp_data=spp_data,
                                                     spp_polygons=spp_polygons,sep=sep,
                                                     habitat_codes=habitat_codes,
                                                     cover=cover)
stopImplicitCluster()
stopCluster(cl)

The parallelization process runs; however, it didn't works how I expected, since it is not using all the cores: Image of processors working. So, what the parallelization process does, is to start 39 (the number of cores) processes: Image of processes opened, but it does't write one by one the files, how I could expect in a regular loop. It suddenly write blocks of 39 files (something that I can understand), but taking a lot of time (because it seems to work in few cores), even more than if I ran a regular loop (running the regular loop each file is written each two or three minutes, while the block of 39 files is written approximately each one hour).

So, here is my second group of questions. 2. What I´m doing bad? 3. Why it is not using all the 39 processors, or it uses them, why it doesn't use them at the maximum level? 4. Why it not begins another task when it finishes one (I suppose it because it always write the files in blocks of 39)?

Thanks in advance for your help.

Cheers,

Jaime

Parallel processing in R doesn´t use all cores

Answers (1)

Related Questions

Parallel processing in R doesn&#180;t use all cores

Answers (1)

Related Questions

Parallel processing in R doesn´t use all cores