How to decompress multiple file .gz chunk by chunk with python

Question

I'm trying to decompress a very large .gz file (commoncrawl web extract) during download, but zlib is stopping after the first file (the file seems to be many concatenated gz file).

import requests,json,zlib
fn="crawl-data/CC-MAIN-2017-04/segments/1484560279933.49/warc/CC-MAIN-20170116095119-00381-ip-10-171-10-70.ec2.internal.warc.gz"
fn="https://commoncrawl.s3.amazonaws.com/"+fn
r = requests.get(fn, stream=True)
d = zlib.decompressobj(zlib.MAX_WBITS | 16)
for chunk in r.iter_content(chunk_size=2048):
    if chunk:
        outstr = d.decompress(chunk)
        print(len(chunk),chunk[:10].hex(),len(outstr),len(d.unused_data))

all the chunks go to "unused_data" and are not decompressed, only the first one.

It works great when piping to zcat :

curl https://commoncrawl.s3... | zcat | ....

Mark Adler · Accepted Answer

You pretty much gave the answer to your own question. You are dealing with a concatenation of gzip streams (which is itself a valid gzip stream), so when you get eof from the decompression object, you need to fire up a new decompressobj for each, using the unused_data you noted from the last one to start the next one.

How to decompress multiple file .gz chunk by chunk with python

Answers (1)

Related Questions