Reputation: 971
I'm trying to count the number of lines in a gz archive. There is only 1 json format text file per gz. But when I open the archive and count the lines the count is way off what I'd expect. The file contains 522 lines, but my code is returning 668480 lines.
import gzip
f = gzip.open(myfile, 'rb')
file_content = f.read()
for i, l in enumerate(file_content):
pass
i += 1
print("File {1} contain {0} lines".format(i, myfile))
Upvotes: 0
Views: 7135
Reputation: 7994
For a performant way to count the lines in a gzip file you can use the pragzip
package:
import pragzip
result = 0
with pragzip.open(myfile) as file:
while chunk := file.read( 1024*1024 ):
result += chunk.count(b'\n')
print(f"Number of lines: {result}")
Comparing the timing of the above with @DmitryKovriga's answer:
Number of lines: 33468793
Elapsed time is 22.373915 seconds.
File datasets/binance-futures_incremental_book_L2_2020-07-01_BTCUSDT.csv.gz contain 33468793 lines
Elapsed time is 31.278056 seconds.
A speed up of more like 10x should be possible with a suitable setup. See https://unix.stackexchange.com/a/713093/163459 for more info.
Upvotes: 1
Reputation: 358
You are iterating over all characters not the lines. You can iterate lines the following way
import gzip
with gzip.open(myfile, 'rb') as f:
for i, l in enumerate(f):
pass
print("File {1} contain {0} lines".format(i + 1, myfile))
Upvotes: 5