Reputation: 3919
I am trying to read a single column of a CSV
file to R
as quickly as possible. I am hoping to cut down on standard methods in terms of the time it takes to get the column into RAM by a factor of 10.
What is my motivation? I have two files; one called Main.csv
which is 300000 rows and 500 columns, and one called Second.csv
which is 300000 rows and 5 columns. If I system.time()
the command read.csv("Second.csv")
, it will take 2.2 seconds. Now if I use either of the two methods below to read the first column of Main.csv
(which is 20% the size of Second.csv
since it is 1 column instead of 5), it will take over 40 seconds. This is the same amount of time as it takes to read the whole 600 Megabyte file -- clearly unacceptable.
Method 1
colClasses <- rep('NULL',500)
colClasses[1] <- NA
system.time(
read.csv("Main.csv",colClasses=colClasses)
) # 40+ seconds, unacceptable
Method 2
read.table(pipe("cut -f1 Main.csv")) #40+ seconds, unacceptable
How to reduce this time? I am hoping for an R
solution.
Upvotes: 15
Views: 15077
Reputation: 308
There is a speed comparison of methods to read large CSV files in this blog. fread is the fastest by an order of magnitude.
As mentioned in the comments above, you can use the select parameter to select which columns to read - so:
fread("main.csv",sep = ",", select = c("f1") )
will work
Upvotes: 11
Reputation: 226936
I would suggest
scan(pipe("cut -f1 -d, Main.csv"))
This differs from the original proposal (read.table(pipe("cut -f1 Main.csv"))
) in a couple of different ways:
cut
assumes tab-separation by default, you need to specify d,
to specify comma-separationscan()
is much faster than read.table
for simple/unstructured data reads.According to the comments by the OP this takes about 4 rather than 40+ seconds.
Upvotes: 14