Hudon
Hudon

Reputation: 1674

What's the fastest way to get CSV output into a data frame?

I have a program that outputs lines of CSV data that I want to load into a data frame. I currently load the data like so:

tmpFilename <- "tmp_file"
system(paste(procName, ">", tmpFilename), wait=TRUE)
myData <- read.csv(tmpFilename) # (I also pass in colClasses and nrows for efficiency)

However, I thought redirecting the output to a file just to read from it was inefficient (the program spits out about 30MB, so I want to handle it with optimal performance). I thought textConnection would solve this, so I tried:

con <- textConnection(system(procName, intern=TRUE))
myData <- read.csv(con)

This runs a lot slower, though, and whereas the first solution degrades linearly with input size, the textConnection solution's performance degrades exponentially it seems. The slowest part is creating the textConnection. read.csv here actually completes quicker than in the first solution since it's reading from memory.

My question is then, is creating a file just to run read.csv on it my best option with respect to speed? Is there a way to speed up the creation of a textConnection? bonus: why is creating a textConnection so slow?

Upvotes: 6

Views: 455

Answers (1)

David F
David F

Reputation: 1265

The "fastest way" will probably involve using something other than read.csv. However, sticking with read.csv, using pipe may be the way to go:

myData <- read.csv(pipe(procName))

It avoids reading the full text output into an intermediate buffer (at least before read.csv gets ahold of it).

Some timing comparisons:

> write.csv(data.frame(x=rnorm(1e5)), row.names=FALSE, file="norm.csv")
> system.time(d <- read.csv("norm.csv"))
   user  system elapsed 
  0.398   0.004   0.402 
> system.time(d <- read.csv(textConnection(system("cat norm.csv", intern=TRUE))))
   user  system elapsed 
 56.159   0.106  56.095 
> system.time(d <- read.csv(pipe("cat norm.csv")))
   user  system elapsed 
  0.475   0.012   0.531 

Upvotes: 3

Related Questions