erikreppel
erikreppel

Reputation: 81

Read multiple files from concatenated gzip in Python

If I have a gzipped file and I concatenate it together with another gzipped file, is it possible to read the files separately in python?

Ex:

cat f1.csv.gz f2.csv.gz > f3.csv.gzip

I know this is possible in Go, but is there a way to do this in Python?

Upvotes: 2

Views: 1377

Answers (2)

RomainGroux
RomainGroux

Reputation: 21

@MarkAdler Thank you very much for this answer. It actually helped me quite a bit !

Now I just want to add a tiny detail that can save a lot of your time. The current answer will not detect truncated files such as gzip/zcat would.

zcat file.gz 
gzip: file.gz: unexpected end of file

To correct this, check decompress.oef. If False, this means the gzip file is truncated. If you don't do this, you'll never see an error.

Here is the modified code:

#!/usr/bin/python
import sys
import zlib
z = zlib.decompressobj(31)
count = 0
while True:
    if z.unused_data == "":
        buf = sys.stdin.read(8192)
        if buf == "":
            # check truncated file
            if not z.eof:
                raise RuntimeError("unexpected end of file")
            break
    else:
        print count
        count = 0
        buf = z.unused_data
        z = zlib.decompressobj(31)
    got = z.decompress(buf)
    count += len(got)
print count

Upvotes: 0

Mark Adler
Mark Adler

Reputation: 112597

Yes. Use z = zlib.decompressobj(31), and then use z to decompress until z.unused_data is not empty, or you have processed all of the input. If you get z.unused_data as not empty, then it contains the start of the next gzip stream. Create a new y = zlib.decompressobj object, and start decompression with the contents of z.unused_data, continuing with more data from the file.

This prints the uncompressed size of each concatenated gzip component:

#!/usr/bin/python
import sys
import zlib
z = zlib.decompressobj(31)
count = 0
while True:
    if z.unused_data == "":
        buf = sys.stdin.read(8192)
        if buf == "":
            break
    else:
        print count
        count = 0
        buf = z.unused_data
        z = zlib.decompressobj(31)
    got = z.decompress(buf)
    count += len(got)
print count

Upvotes: 2

Related Questions