user1042267
user1042267

Reputation: 303

Writing to a single common file in R using foreach

I have a parallel process in R that should save results from each thread to one common file. But doing so creates problem as there is data overlap. I do aggregate in a data-frame and can write all the data at once in the end but since the data is huge, I want to make sure that data is not lost if system runs out of memory or something else happens. How do i write to one file and make sure that file is locked or data is written asynchronously. I am running my code on Windows just in case and am using doSNOW for parallelization.

Here is the main code

HedgedPortfolio <- data.frame()
cl<-makeCluster(6) 
registerDoSNOW(cl)
no<-length(X)
HedgedPortfolio<-foreach(i=1:no,.combine='rbind') %dopar%
{
  HedgeMain(as.Date(X[i]),InitPnlRecon)
}

stopCluster(cl)
HedgeMain<-function(X,InitPnlRecon)
{    
    OptimizedPort<-.............some computation                               
    write.table(OptimizedPort,file="C:/OptimizedAll.opt",     
                quote=FALSE,append=TRUE,sep=";",
                col.names = FALSE,row.names = FALSE)
    OptimizedPort
}

Upvotes: 3

Views: 1742

Answers (2)

armen
armen

Reputation: 443

Too late but as I was looking for a similar facility for my code came up with the comm.write function from the pbdMPI package. Based on the documentation, rank 0 created the file and writes to it then the rest of the ranks, in order append data to that file.

Upvotes: 0

IRTFM
IRTFM

Reputation: 263411

I do not think that R write.* or cat functions provide the necessary file locking facilities for using a single desination file. You either need to access a database that supports such facilities or use multiple files. From that added requirement of resiliency in the event of node terminations, it sounds to me that you don't really want to run this as a tightly coupled process but rather as distributed batch processes. There is a "Resource managers and batch schedulers" section in the High Performance Computing Task View where several of the packages sound applicable to this task: batch and BatchJobs in particular

There has been a recent discussion on R-Help and HPC-SIG that may be relevant. The thread starts here:

https://stat.ethz.ch/pipermail/r-help/2012-September/324748.html

Some of the threads describe a method for accession of particular points the middle of a disk file by separate CPU-workers. You would still need your own coding to correctly ensure that you were not overwriting "good data".

Upvotes: 1

Related Questions