Bangyou
Bangyou

Reputation: 9816

Read large number of small files efficiently in R

I have about 700K small files (condor logs files and less than 10 KB). There are no rules for filenames. I am using list.files to obtain all filenames, then reading them with readLines, and merge them as a list.

Currently It will take several hours to read all files. These are my codes to read log files.

rm(list = ls())

base <- 'logs-025'
exts <- c('log', 'out', 'err')

for (i in seq(along = exts))
{
    all_files <- list.files(base, paste0('apsim_.*.', exts[i]), full.names = TRUE)
    res <- NULL
    for (j in seq(along = all_files))
    {
        res_j <- readLines(all_files[j])
        res[[j]] <- res_j
    }
    save(res, file = paste0(Sys.info()['nodename'], '-', exts[i], '.RData'))
}

Is there a efficiently way to read large number of small files in R?

Thanks for any advice.

Cheers, Bangyou

Upvotes: 2

Views: 332

Answers (2)

l.dev
l.dev

Reputation: 21

You can consider ldply from the plyr library. There's certainly a newer, more efficient option, but this one significantly speeds up the loading time for lots of small files compared with a for loop:

library("plyr")
a_table <- ldply(file_path_list, function(x){
   path <- x
   line <- readLines(path)
   return(line)
})

Assuming you have a list (file_path_list) containing all the paths to your individual files.

Upvotes: 0

JoelKuiper
JoelKuiper

Reputation: 4720

Depending on the total size of the data set (i.e. will it fit in memory) you might want to memory map the files (for example with the ff package)

But in general the performances of R's IO functions are poor and I can recommend writing those loops in C

Upvotes: 1

Related Questions