Reputation: 16277
I have a large text file (475,000,000 lines). I would like to quickly get the number of rows in the file without reading it.
fread
from data.table
actually comes up with the row number quite rapidly (~10 seconds) before it proceeds to read the whole file:
fread('D:/text_file.txt',select=1,colClasses="character")
Read 7.1% of 472933221 rows #number of rows appears after 10 seconds
Is there a way to extract this row number without reading the whole file afterwards? For the record, reading the whole file takes 36 seconds.
I have tried countLines
from R.utils
but it takes 53 seconds. The difference might be that fread
has an option to select only one column and countLines reads everything.
R.utils::countLines("D:/text_file.txt") #53 seconds
I have also tried other Windows methods such as:
find /v /c "" "D:\text_file.txt" #takes 1 minute 50 seconds
grep "^" D:\text_file.txt | wc -l #takes 2 minutes
These work, but they're not as fast as fread
. I'm on Windows.
Upvotes: 8
Views: 981
Reputation: 16277
@d.b asked me to provide a detailed answer to my own question. As @G. Grothendieck suggested, the answer is to use wc
, which is part of Rtools, a collection of resources for building packages for R under Microsoft Windows.
Once installed, make sure C:\Rtools\bin
is in your PATH
in environment variables in Windows.
Then, wc
becomes available to R using system
or shell
:
shell('wc -l "D:/text_file.txt"',intern =TRUE)
Upvotes: 6