Reputation: 1131
I am reading in a large file one line at a time. What I would like to do to speed everything up is to work on multiple lines in parallel. But the way I am doing it right now is not working. I have never tried this so I am not really sure how it works.
library(foreach)
library(doParallel) #or with doMC
read.block <- function(ifile, lines, block, readFunc=read.csv,
skip=(lines*(block-1))+ifelse((header) & (block>1) & (!inherits(ifile, "connection")),1,0),
nrows=lines,header=TRUE,sep="\t",...){
if(block > 1){
colnms<-NULL
if(header)
{
colnams <- unlist(readFunc(ifile, nrows=1, header=FALSE, sep=sep, stringsAsFactors=FALSE))
#print(colnams)
}
p = readFunc(ifile, skip = skip, nrows = nrows, header=FALSE, sep=sep,...)
if(! is.null(colnams))
{
colnames(p) = colnams
}
} else {
p = readFunc(ifile, skip = skip, nrows = nrows, header=header, sep=sep)
}
return(p)
}
mendl.error <- matrix(, nrow=15, ncol=9)
foreach(i=1:15)%dopar%{
ifile.c <- file("testdata.csv", open = "r") #open file connection to read
ifile.valid <- read.block(ifile.c, lines=1, block=i) #read 1 line
close(ifile.c)
#do some other operations on the line which will be saved into a matrix
mendl.error[1,] <- ifile.valid
}
Upvotes: 0
Views: 296
Reputation: 545865
You haven’t specified what “doesn’t work” means but I’m going on a limb and say that it doesn’t speed up as expected (although I’m also not quite clear about the semantics of what you’re attempting to do).
The reason for this is that your code is not compute bound, it’s IO bound. Meaning, it has to wait for data from secondary storage. The bus for that data isn’t parallel, so all your data read requests are getting serialised. You cannot speed this up significantly by using parallelism in the fashion you’re attempting to.
In fact, your code is probably sped up if you do the reading in one go and rely on R to do the right thing. IF you really need raw reading performance here you probably need to resort to memory-mapped files. A quick Google search has turned up the R package bigmemory
which implements this.
Upvotes: 2