user3578723
user3578723

Reputation: 13

Read line by line gzip big file

I need to know how many times a number appears in a gzip file with 2912232966 lines , I have the following:

import gzip
from itertools import islice

count=0
f = gzip.open(file,'rb') 
for line in f:
lin = line.decode('utf-8')
number = lin[:lin.index('\t')]
if number == '2719708':
  conunt+=1

but i get this: 'CRC check failed 0xabc8df68 != 0xba1760acL'

it only works only works up to 400000000 lines, help please

Upvotes: 1

Views: 2525

Answers (1)

Amazingred
Amazingred

Reputation: 1007

link to zlib

quote from jiffyclubs answer here

The issue with the gzip module is not that it can't decompress the partial file, the error occurs only at the end when it tries to verify the checksum of the decompressed content. (The original checksum is stored at the end of the compressed file so the verification will never, ever work with a partial file.)

The key is to trick gzip into skipping the verification. The answer by caesar0301 does this by modifying the gzip source code, but it's not necessary to go that far, simple monkey patching will do. I wrote this context manager to temporarily replace gzip.GzipFile._read_eof while I decompress the partial file:

This looks to be exactly what you need....

Go to that link and read the entire respose.


Found by searching google for a stackexchange link to "python gzip crc check failed" first result

Upvotes: 1

Related Questions