Reputation: 9816
I have about 700K small files (condor logs files and less than 10 KB). There are no rules for filenames. I am using list.files to obtain all filenames, then reading them with readLines, and merge them as a list.
Currently It will take several hours to read all files. These are my codes to read log files.
rm(list = ls())
base <- 'logs-025'
exts <- c('log', 'out', 'err')
for (i in seq(along = exts))
{
all_files <- list.files(base, paste0('apsim_.*.', exts[i]), full.names = TRUE)
res <- NULL
for (j in seq(along = all_files))
{
res_j <- readLines(all_files[j])
res[[j]] <- res_j
}
save(res, file = paste0(Sys.info()['nodename'], '-', exts[i], '.RData'))
}
Is there a efficiently way to read large number of small files in R?
Thanks for any advice.
Cheers, Bangyou
Upvotes: 2
Views: 332
Reputation: 21
You can consider ldply
from the plyr
library. There's certainly a newer, more efficient option, but this one significantly speeds up the loading time for lots of small files compared with a for loop:
library("plyr")
a_table <- ldply(file_path_list, function(x){
path <- x
line <- readLines(path)
return(line)
})
Assuming you have a list (file_path_list
) containing all the paths to your individual files.
Upvotes: 0
Reputation: 4720
Depending on the total size of the data set (i.e. will it fit in memory) you might want to memory map the files (for example with the ff package)
But in general the performances of R's IO functions are poor and I can recommend writing those loops in C
Upvotes: 1