Pierre Lapointe
Pierre Lapointe

Reputation: 16277

Extract number of rows from fread without reading the whole file

I have a large text file (475,000,000 lines). I would like to quickly get the number of rows in the file without reading it.

fread from data.table actually comes up with the row number quite rapidly (~10 seconds) before it proceeds to read the whole file:

fread('D:/text_file.txt',select=1,colClasses="character")
Read 7.1% of 472933221 rows #number of rows appears after 10 seconds

Is there a way to extract this row number without reading the whole file afterwards? For the record, reading the whole file takes 36 seconds.

I have tried countLines from R.utils but it takes 53 seconds. The difference might be that fread has an option to select only one column and countLines reads everything.

R.utils::countLines("D:/text_file.txt") #53 seconds

I have also tried other Windows methods such as:

find /v /c "" "D:\text_file.txt" #takes 1 minute 50 seconds
grep "^" D:\text_file.txt | wc -l #takes 2 minutes

These work, but they're not as fast as fread. I'm on Windows.

Upvotes: 8

Views: 981

Answers (1)

Pierre Lapointe
Pierre Lapointe

Reputation: 16277

@d.b asked me to provide a detailed answer to my own question. As @G. Grothendieck suggested, the answer is to use wc, which is part of Rtools, a collection of resources for building packages for R under Microsoft Windows.

Once installed, make sure C:\Rtools\bin is in your PATH in environment variables in Windows.

Then, wc becomes available to R using system or shell:

shell('wc -l "D:/text_file.txt"',intern =TRUE)

Upvotes: 6

Related Questions