How to get better performance in R: one big file or several smaller files?

Question

I had about 200 different files (all of them were big matrices, 465x1080) (that is huge for me). I then used cbind2 to make them all one bigger matrix (465x200000).

I did that because I needed to create one separate file for each row (465 files) and I thought that it would be easier for R to load the data from 1 file to the memory only ONCE and then just read line per line creating a separate file for each one of them, instead of opening and closing 200 different files for every row.

Is this really the faster way? (I am wondering because now it is taking quite a lot to do that). When I check in the Task Manager from Windows it shows the RAM used by R and it just goes from 700MB to 1GB to 700MB all the time (twice every second). Seems like the main file wasn't loaded just once, but that it is being loaded and erased from the memory in every iteration (which could be the reason why it is a bit slow?).

I am a beginner so all of this that I wrote might not make any sense.

Here is my code: (those +1 and -1 are because the original data has 1 extra column that I dont need in the new files)

extractStationData <- function(OriginalData, OutputName = "BCN-St") {

for (i in 1:nrow(OriginalData)) {

    OutputData <- matrix(NA,nrow = ncol(OriginalData)-1,3)
    colnames(OutputData) <- c("Time","Bikes","Slots")

    for (j in 1:(ncol(OriginalData)-1)) {

        OutputData[j,1] <- colnames(OriginalData[j+1])
        OutputData[j,2] <- OriginalData[i,j+1]

    }

    write.table(OutputData,file = paste(OutputName,i,".txt",sep = ""))
    print(i)

}

}

Any thoughts? Maybe I should just create an object (the huge file) before the first for loop and then it would be loaded just once?

Thanks in advance.

minem · Accepted Answer

Lets assume you have already created the 465x200000 matrix and in question are only extractStationData function. Then we can modify it for example like this:

require(data.table)
extractStationData <- function(d, OutputName = "BCN-St") {
  d2 <- d[, -1] # remove the column you do not need
  # create empty matrix outside loop:
  emtyMat <- matrix(NA, nrow = ncol(d2), 3)
  colnames(emtyMat) <- c("Time","Bikes","Slots")
  emtyMat[, 1] <- colnames(d2)
  for (i in 1:nrow(d2)) {
    OutputData <- emtyMat
    OutputData[, 2] <- d2[i, ]
    fwrite(OutputData, file = paste(OutputName, i, ".txt", sep = "")) # use fwrite for speed
  }
}

V2:

If your OriginalData is in matrix format this approach for creating the list of new data.tables looks quite fast:

extractStationData2 <- function(d, OutputName = "BCN-St") {
  d2 <- d[, -1] # romove the column you dont need
  ds <- split(d2, 1:nrow(d2))
  r <- lapply(ds, function(x) {
    k <- data.table(colnames(d2), x, NA)
    setnames(k, c("Time","Bikes","Slots"))
    k
  })
  r
}
dl <- extractStationData2(d) # list of new data objects
# write to files:
for (i in seq_along(dl)) {
  fwrite(dl[[i]], file = paste(OutputName, i, ".txt", sep = ""))
  }

Should work also for data.frame with minor changes: k <- data.table(colnames(d2), t(x), NA)

How to get better performance in R: one big file or several smaller files?

Answers (1)

Related Questions