temor
temor

Reputation: 1029

Indexing takes long time with for loop?

I am running this for loop without any problems but it takes a long time. I guess it can be faster with apply family but not sure how. Any hints?

set.seed(1)
nrows <- 1200
ncols <- 1000
outmat <- matrix(NA, nrows, ncols)
dat <- matrix(5, nrows, ncols)
 for (nc in 1 : ncols){
  for(nr in 1 : nrows){
    val <- dat[nr, nc]
    if(!is.na(val)){
      file <- readBin(dir2[val], numeric(), size = 4, n = 1200*1000)
      # my real data where dir2 is a list of files 
      # "dir2 <- list.files("/data/dir2", "*.dat", full.names = TRUE)"
      file <- matrix((data = file), ncol = 1000, nrow = 1200) #my real data

      outmat[nr, nc] <-  file[nr, nc]
    }

  }
}

Upvotes: 0

Views: 176

Answers (1)

nicola
nicola

Reputation: 24490

Two solutions.

The first fills more memory, but is more efficient and I guess feasible if you have 24 files, as you stated. You read all the files at once, then you subset properly according to dat. Something like:

allContents<-do.call(cbind,lapply(dir2,readBin,n=nrows*ncol,size=4,"numeric")
res<-matrix(allContents[cbind(1:length(dat),c(dat+1))],nrows,ncols)

The second one can handle a slightly bigger number of files (say 50-100). It reads chunks of each file and subset consequently. You have to open as many connections as the number of files you got. For instance:

outmat <- matrix(NA, nrows, ncols)
connections<-lapply(dir2,file,open="rb")
for (i in 1:ncols)  {
    values<-vapply(connections,readBin,what="numeric",n=nr,size=4,numeric(nr))
    outmat[,i]<-values[cbind(seq_len(nrows),dat[,i]+1)]
}

The +1 after dat is due to the fact that, as you stated in the comments, the values in dat range from 0 to 23 and R indexing is 1-based.

Upvotes: 3

Related Questions