user1043502
user1043502

Reputation:

How to read a Gzip String with no header or mimetype? Using Python

I have a gzipped string, it is created an stored from another application. Now that I have the string (no mimetype or headers attached), I need to uncompress it.

Is there a way to do this in Python?

[EDIT] To test I literally copied then pasted the string into notepad and then renamed as .gz I've also tested by pasting the string itself into IDLE

Other examples I've seen assume a filetype and mimetype are available and all I have is a big string.

Using zlib.decompress(mystring) gives error Error -3 while decompressing data: incorrect header check

Upvotes: 3

Views: 4746

Answers (1)

John Machin
John Machin

Reputation: 82924

Confirming the comments by @reclosedev, and adding some more:

The bytes after the ] need to be base64-decoded.

In the result of that, there are 4 bytes constituting the length of the decompressed data as a 32-bit little-endian binary number. The remainder is an RFC-1952-compliant gzip stream, recognisable by starting with 1F 8B 08. The decompression results look like binary data, not strings of ASCII 1s and 0s.

Code:

lines = [
    # extracted from the linked csv file 
    "[133,120,696,286]MmEAAB+LCAAAAAAABADtvQdg [BIG snip] a0bokyYQAA",
    "[73,65,564,263]bkgAAB+LCAAAAAAABADtvQdgHE [BIG snip] kgAAA==",
    ]
import zlib, struct
for line in lines:
    print
    b64 = line.split(']')[1]
    raw = b64.decode('base64')
    print "unknown:", repr(raw[:4])
    print "unknown as 32-bit LE int:", struct.unpack("<I", raw[:4])[0]
    ungz = zlib.decompress(raw[4:], 31)
    print len(ungz), "bytes in decompressed data"
    print "first 100:", repr(ungz[:100])

Output:

unknown: '2a\x00\x00'
unknown as 32-bit LE int: 24882
24882 bytes in decompressed data
first 100: '\xff\xe0\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xff\xf0\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00'

unknown: 'nH\x00\x00'
unknown as 32-bit LE int: 18542
18542 bytes in decompressed data
first 100: '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x7f\xff\xff\xff\xff
\xff\xff\xff\xff\xff\xff\xff\xff\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x07\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\x80
\x00\x00\x00'

Update in response to comment

To get the 1s and 0s I needed I just added this to the above
cleaned = bin(int(binascii.hexlify(ungz), 16))

"Just"? You would need to strip off '0b' from the front, and then pad the front with as many leading zeroes as necessary to make the length a multiple of 8. Example, with a better method:

>>> import binascii
>>> ungz = '\x01\x80'
>>> bin(int(binascii.hexlify(ungz), 16))
'0b110000000'
>>> ''.join('{0:08b}'.format(ord(x)) for x in ungz)
'0000000110000000'

Have you checked carefully to ensure that you really want '0000000110000000' and not '1000000000000001'?

Upvotes: 2

Related Questions