Reputation: 81
If I have a gzipped file and I concatenate it together with another gzipped file, is it possible to read the files separately in python?
Ex:
cat f1.csv.gz f2.csv.gz > f3.csv.gzip
I know this is possible in Go, but is there a way to do this in Python?
Upvotes: 2
Views: 1377
Reputation: 21
@MarkAdler Thank you very much for this answer. It actually helped me quite a bit !
Now I just want to add a tiny detail that can save a lot of your time. The current answer will not detect truncated files such as gzip/zcat would.
zcat file.gz
gzip: file.gz: unexpected end of file
To correct this, check decompress.oef. If False, this means the gzip file is truncated. If you don't do this, you'll never see an error.
Here is the modified code:
#!/usr/bin/python
import sys
import zlib
z = zlib.decompressobj(31)
count = 0
while True:
if z.unused_data == "":
buf = sys.stdin.read(8192)
if buf == "":
# check truncated file
if not z.eof:
raise RuntimeError("unexpected end of file")
break
else:
print count
count = 0
buf = z.unused_data
z = zlib.decompressobj(31)
got = z.decompress(buf)
count += len(got)
print count
Upvotes: 0
Reputation: 112597
Yes. Use z = zlib.decompressobj(31)
, and then use z
to decompress until z.unused_data
is not empty, or you have processed all of the input. If you get z.unused_data
as not empty, then it contains the start of the next gzip stream. Create a new y = zlib.decompressobj
object, and start decompression with the contents of z.unused_data
, continuing with more data from the file.
This prints the uncompressed size of each concatenated gzip component:
#!/usr/bin/python
import sys
import zlib
z = zlib.decompressobj(31)
count = 0
while True:
if z.unused_data == "":
buf = sys.stdin.read(8192)
if buf == "":
break
else:
print count
count = 0
buf = z.unused_data
z = zlib.decompressobj(31)
got = z.decompress(buf)
count += len(got)
print count
Upvotes: 2