Running two instances of R in order to improve large data reading performance

Question

I would like to read-in a number of CSV files (~50), run a number of operations, and then use write.csv() to output a master file. Since the CSV files are on the larger side (~80 Mb), I was wondering if it might be more efficient to open two instances of R, reading-in half the CSVs on one instance, and half on the other. Then I would write each to a large CSV, read-in both CSVs, and combine them into a master CSV. Does anyone know if running two instances of R will improve the time it takes to read-in all the csv's?

I'm using a Macbook Pro OSX 10.6 with 4Gb RAM.

Karl Forner · Accepted Answer

read.table() and related can be quite slow. The best way to tell if you can benefit from parallelization is to time your R script, and the basic reading of your files. For instance, in a terminal:

time cat *.csv > /dev/null

If the "cat" time is significantly lower, your problem is not I/O bound and you may parallelize. In which case you should probably use the parallel package, e.g

library(parallel)
csv_files <- c(.....)
my_tables <- mclapply(csv_files, read.csv)

Running two instances of R in order to improve large data reading performance

Answers (2)

Related Questions