baptiste
baptiste

Reputation: 77106

load new files in directory

I have a R script to load multiple text files in a directory and save the data as compressed .rda. It looks like this,

#!/usr/bin/Rscript --vanilla

args <- commandArgs(TRUE)
## arg[1] is the folder name

outname <- paste(args[1], ".rda", sep="")

files <- list.files(path=args[1], pattern=".txt", full=TRUE)

tmp <- list()
if(file.exists(outname)){
  message("found ", outname)
  load(outname)
  tmp <- get(args[1]) # previously read stuff
  files <- setdiff(files, names(tmp))

}

 if(is.null(files)) 
    message("no new files") else {

## read the files into a list of matrices
results <- plyr::llply(files, read.table, .progress="text")
names(results) <- files

assign(args[1], c(tmp, results))
message("now saving... ", args[1])
save(list=args[1], file=outname)
}
message("all done!")

The files are quite large (15Mb each, 50 of them typically), so running this script takes up to a few minutes typically, a substantial part of which is taken writing the .rda results.

I often update the directory with new data files, therefore I would like to append them to the previously saved and compressed data. This is what I do above by checking if there's already an output file with that name. The last step is still pretty slow, saving the .rda file.

Is there a smarter way to go about this in some package, keeping a trace of which files have been read, and saving this faster?

I saw that knitr uses tools:::makeLazyLoadDB to save its cached computations, but this function is not documented so I'm not sure where it makes sense to use it.

Upvotes: 6

Views: 1019

Answers (1)

cbeleites
cbeleites

Reputation: 14093

For intermediate files that I need to read (or write) often, I use

save (..., compress = FALSE)

which speeds up things considerably.

Upvotes: 6

Related Questions