Jacob Rosenberg-Wohl
Jacob Rosenberg-Wohl

Reputation: 13

Running two instances of R in order to improve large data reading performance

I would like to read-in a number of CSV files (~50), run a number of operations, and then use write.csv() to output a master file. Since the CSV files are on the larger side (~80 Mb), I was wondering if it might be more efficient to open two instances of R, reading-in half the CSVs on one instance, and half on the other. Then I would write each to a large CSV, read-in both CSVs, and combine them into a master CSV. Does anyone know if running two instances of R will improve the time it takes to read-in all the csv's?

I'm using a Macbook Pro OSX 10.6 with 4Gb RAM.

Upvotes: 1

Views: 178

Answers (2)

Karl Forner
Karl Forner

Reputation: 4414

read.table() and related can be quite slow. The best way to tell if you can benefit from parallelization is to time your R script, and the basic reading of your files. For instance, in a terminal:

time cat *.csv > /dev/null

If the "cat" time is significantly lower, your problem is not I/O bound and you may parallelize. In which case you should probably use the parallel package, e.g

library(parallel)
csv_files <- c(.....)
my_tables <- mclapply(csv_files, read.csv)

Upvotes: 1

Joshua Ulrich
Joshua Ulrich

Reputation: 176698

If the majority of your code execution time is spent reading the files, then it will likely be slower because the two R processes will be competing for disk I/O. But it would be faster if the majority of the time is spent "running a number of operations".

Upvotes: 2

Related Questions