Quicker way to read single column of CSV file

Question

I am trying to read a single column of a CSV file to R as quickly as possible. I am hoping to cut down on standard methods in terms of the time it takes to get the column into RAM by a factor of 10.

What is my motivation? I have two files; one called Main.csv which is 300000 rows and 500 columns, and one called Second.csv which is 300000 rows and 5 columns. If I system.time() the command read.csv("Second.csv"), it will take 2.2 seconds. Now if I use either of the two methods below to read the first column of Main.csv (which is 20% the size of Second.csv since it is 1 column instead of 5), it will take over 40 seconds. This is the same amount of time as it takes to read the whole 600 Megabyte file -- clearly unacceptable.

Method 1

colClasses <- rep('NULL',500)

colClasses[1] <- NA
system.time(
read.csv("Main.csv",colClasses=colClasses)
) # 40+ seconds, unacceptable

Method 2

 read.table(pipe("cut -f1 Main.csv")) #40+ seconds, unacceptable

How to reduce this time? I am hoping for an R solution.

Ben Bolker · Accepted Answer

I would suggest

scan(pipe("cut -f1 -d, Main.csv"))

This differs from the original proposal (read.table(pipe("cut -f1 Main.csv"))) in a couple of different ways:

since the file is comma-separated and cut assumes tab-separation by default, you need to specify d, to specify comma-separation
scan() is much faster than read.table for simple/unstructured data reads.

According to the comments by the OP this takes about 4 rather than 40+ seconds.

Quicker way to read single column of CSV file

Answers (2)

Related Questions