Mark J Seger
Mark J Seger

Reputation: 367

Reading an improperly closed gz file with Python

When I try to read a gz file with python using the gzip library it generates an error much in the same way as if you were to try to run gunzip on it. However, it IS possible to do this with perl because I don't believe the library it uses makes the additional check for a clean EOF on the file being read.

My question is are there any options or alternative libraries for reading such a file in python or do I just need to do this in perl?

Upvotes: 1

Views: 1513

Answers (2)

Asclepius
Asclepius

Reputation: 63272

To decompress incomplete gzipped bytes that are in memory, the answer by Yann Vernier is useful but it misses the wbits arg which I found to be necessary:

incomplete_decompressed_content = zlib.decompressobj(wbits=zlib.MAX_WBITS | 16).decompress(incomplete_gzipped_content)

Note that zlib.MAX_WBITS | 16 is 15 | 16 which is 31. For some background about wbits, see zlib.decompress.


Credit: answer by dnozay which notes the lower bounds of different values of wbits needed for different encodings.

Upvotes: 1

Yann Vernier
Yann Vernier

Reputation: 15877

The standard Python library can be used for this, albeit more clumsily than for intact files.

>>> import zlib
>>> compressed=zlib.compress(str(range(200)))
>>> len(compressed)
375
>>> trunc=compressed[:50]
>>> zlib.decompress(trunc)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
zlib.error: Error -5 while decompressing data: incomplete or truncated stream
>>> d=zlib.decompressobj()
>>> d.decompress(trunc)
'[0, 1, 2, 3, 4, 5, 6, 7, 8, 9'
>>> d.flush()
''

Note that decompressobj.flush() requests the last data, so only call it after your input stream has ended (or on a copy - there is a decompressobj.copy() method). You can feed compressed data in with as many decompressobj.decompress() calls as you like.

>>> d=zlib.decompressobj()
>>> for i in range(0,140,10):
...   print repr(d.decompress(compressed[i:i+10]))
...
''
''
''
'[0, 1, 2, 3, 4'
', 5, 6, 7, 8, 9'
', 10, 11, 12, 13, 14, 15, 16, '
'17, 18, 19, 20, 21, 22, 23, '
'24, 25, 26, 27, 28, 29, 3'
'0, 31, 32, 33, 34, 35, 36, '
'37, 38, 39, 40, 41, 42, 4'
'3, 44, 45, 46, 47, 48, 49, '
'50, 51, 52, 53, 54, 55, 5'
'6, 57, 58, 59, 60, 61, 62, 6'
'3, 64, 65, 66, 67, 68, 6'
>>> d.flush()
''

(I haven't seen flush() actually return anything, but that's probably because this is such a simple data sample.)

Edit: I missed one point. Gzip files have a header which the gzip module normally handles, so raw access to zlib will not read gzip files directly. It may be easier to use GzipFile and read in smaller chunks.

Upvotes: 3

Related Questions