crysis405
crysis405

Reputation: 1131

Work on 1 line at a time in parallel

I am reading in a large file one line at a time. What I would like to do to speed everything up is to work on multiple lines in parallel. But the way I am doing it right now is not working. I have never tried this so I am not really sure how it works.

testdata.csv

library(foreach)
library(doParallel) #or with doMC



read.block <- function(ifile, lines, block, readFunc=read.csv,
                   skip=(lines*(block-1))+ifelse((header) & (block>1) & (!inherits(ifile, "connection")),1,0),
                   nrows=lines,header=TRUE,sep="\t",...){
  if(block > 1){
    colnms<-NULL
    if(header)
    {
      colnams <- unlist(readFunc(ifile, nrows=1, header=FALSE, sep=sep, stringsAsFactors=FALSE))
      #print(colnams)
    }
    p = readFunc(ifile, skip = skip, nrows = nrows, header=FALSE, sep=sep,...)
    if(! is.null(colnams))
    {
      colnames(p) = colnams
    }
  } else {
    p = readFunc(ifile, skip = skip, nrows = nrows, header=header, sep=sep)
  }
  return(p)
}

mendl.error <- matrix(, nrow=15, ncol=9)

foreach(i=1:15)%dopar%{
  ifile.c <- file("testdata.csv", open = "r") #open file connection to read
  ifile.valid <- read.block(ifile.c, lines=1, block=i) #read 1 line
  close(ifile.c) 
#do some other operations on the line which will be saved into a matrix
  mendl.error[1,] <- ifile.valid
}

Upvotes: 0

Views: 296

Answers (1)

Konrad Rudolph
Konrad Rudolph

Reputation: 545865

You haven’t specified what “doesn’t work” means but I’m going on a limb and say that it doesn’t speed up as expected (although I’m also not quite clear about the semantics of what you’re attempting to do).

The reason for this is that your code is not compute bound, it’s IO bound. Meaning, it has to wait for data from secondary storage. The bus for that data isn’t parallel, so all your data read requests are getting serialised. You cannot speed this up significantly by using parallelism in the fashion you’re attempting to.

In fact, your code is probably sped up if you do the reading in one go and rely on R to do the right thing. IF you really need raw reading performance here you probably need to resort to memory-mapped files. A quick Google search has turned up the R package bigmemory which implements this.

Upvotes: 2

Related Questions