Reputation: 13
I would like to read-in a number of CSV files (~50), run a number of operations, and then use write.csv()
to output a master file. Since the CSV files are on the larger side (~80 Mb), I was wondering if it might be more efficient to open two instances of R, reading-in half the CSVs on one instance, and half on the other. Then I would write each to a large CSV, read-in both CSVs, and combine them into a master CSV. Does anyone know if running two instances of R will improve the time it takes to read-in all the csv's?
I'm using a Macbook Pro OSX 10.6 with 4Gb RAM.
Upvotes: 1
Views: 178
Reputation: 4414
read.table() and related can be quite slow. The best way to tell if you can benefit from parallelization is to time your R script, and the basic reading of your files. For instance, in a terminal:
time cat *.csv > /dev/null
If the "cat" time is significantly lower, your problem is not I/O bound and you may parallelize. In which case you should probably use the parallel package, e.g
library(parallel)
csv_files <- c(.....)
my_tables <- mclapply(csv_files, read.csv)
Upvotes: 1
Reputation: 176698
If the majority of your code execution time is spent reading the files, then it will likely be slower because the two R processes will be competing for disk I/O. But it would be faster if the majority of the time is spent "running a number of operations".
Upvotes: 2