Reputation: 3805
I have a folder that has 5000 csv files, each file belonging to one location and containing daily rainfall from 1980 till 2015. Sample structure of a file is as follows:
sample.file <- data.frame(location.id = rep(1001, times = 365 * 36),
year = rep(1980:2015, each = 365),
day = rep(1:365, times = 36),
rainfall = sample(1:100, replace = T, 365 * 36))
I want to read one file and calculate for each year, total rainfall and write the output again. There are multiple ways I can do this:
for(i in seq_along(names.vec)){
name <- namees.vec[i]
dat <- fread(paste0(name,".csv"))
dat <- dat %>% dplyr::group_by(year) %>% dplyr::summarise(tot.rainfall = sum(rainfall))
fwrite(dat, paste0(name,".summary.csv"), row.names = F)
}
my.files <- list.files(pattern = "*.csv")
dat <- lapply(my.files, fread)
dat <- rbindlist(dat)
dat.summary <- dat %>% dplyr::group_by(location.id, year) %>%
dplyr::summarise(tot.rainfall = sum(rainfall))
I want to achieve this using foreach
. How can I parallelise the above task
using do parallel
and for each
function?
Upvotes: 0
Views: 257
Reputation: 166
Below is the skeleton for your foreach request
.
require(foreach)
require(doSNOW)
cl <- makeCluster(10, # number of cores, don't use all cores your computer have
type="SOCK") # SOCK for Windows, FORK for linux
registerDoSNOW(cl)
clusterExport(cl, c("toto", "truc"), envir=environment()) # R object needed for each core
clusterEvalQ(cl, library(tcltk)) # libraries needed for each core
my.files <- list.files(pattern = "*.csv")
foreach(i=icount(my.files), .combine=rbind, inorder=FALSE) %dopar% {
# read csv file
# estimate total rain
# write output
}
stopCluster(cl)
But the parallelization is really better when the computation time (CPU) per independant iteration is higher than the remaining operations. In your case, the improvement can be low because each core will need to have drive access for reading and for writing, and as the writing is a physical operation, it can be better to do it sequentially (safer for the hardware and eventually more efficient to have independant locations in the drive for each file compared to shared location for multiple files, needing indexes and so on to distinguish them for your OS -- the previous need confirmation, it is just a thought).
HTH
Bastien
Upvotes: 2
Reputation: 711
pbapply package is easiest paralleling approach
library (pbapply)
mycl <- makeCluster(4)
mylist <- pblapply(my.files, fread, cl = mycl)
Upvotes: 0