Using foreach function to parallelise calculation

Question

I have a folder that has 5000 csv files, each file belonging to one location and containing daily rainfall from 1980 till 2015. Sample structure of a file is as follows:

sample.file <- data.frame(location.id = rep(1001, times = 365 * 36), 
                      year = rep(1980:2015, each = 365),
                      day = rep(1:365, times = 36),
                      rainfall = sample(1:100, replace = T, 365 * 36))

I want to read one file and calculate for each year, total rainfall and write the output again. There are multiple ways I can do this:

Method 1

for(i in seq_along(names.vec)){

  name <- namees.vec[i]
  dat <- fread(paste0(name,".csv"))

  dat <- dat %>% dplyr::group_by(year) %>% dplyr::summarise(tot.rainfall = sum(rainfall))

 fwrite(dat, paste0(name,".summary.csv"), row.names = F)
}

Method 2:

my.files <- list.files(pattern = "*.csv")
dat <- lapply(my.files, fread)
dat <- rbindlist(dat)
dat.summary <- dat %>% dplyr::group_by(location.id, year) %>% 
               dplyr::summarise(tot.rainfall = sum(rainfall))

Method 3:

I want to achieve this using foreach. How can I parallelise the above task using do parallel and for each function?

Bastien · Accepted Answer

Below is the skeleton for your foreach request.

require(foreach)
require(doSNOW)
cl <- makeCluster(10, # number of cores, don't use all cores your computer have
                  type="SOCK") # SOCK for Windows, FORK for linux
registerDoSNOW(cl)
clusterExport(cl, c("toto", "truc"), envir=environment()) # R object needed for each core
clusterEvalQ(cl, library(tcltk)) # libraries needed for each core
my.files <- list.files(pattern = "*.csv")
foreach(i=icount(my.files), .combine=rbind, inorder=FALSE) %dopar% {
  # read csv file
  # estimate total rain
  # write output
}
stopCluster(cl)

But the parallelization is really better when the computation time (CPU) per independant iteration is higher than the remaining operations. In your case, the improvement can be low because each core will need to have drive access for reading and for writing, and as the writing is a physical operation, it can be better to do it sequentially (safer for the hardware and eventually more efficient to have independant locations in the drive for each file compared to shared location for multiple files, needing indexes and so on to distinguish them for your OS -- the previous need confirmation, it is just a thought).

HTH

Bastien

Using foreach function to parallelise calculation

Method 1

Method 2:

Method 3:

Answers (2)

Related Questions