Reputation:
I have a gzipped string, it is created an stored from another application. Now that I have the string (no mimetype or headers attached), I need to uncompress it.
Is there a way to do this in Python?
[EDIT] To test I literally copied then pasted the string into notepad and then renamed as .gz
I've also tested by pasting the string itself into IDLE
Other examples I've seen assume a filetype and mimetype are available and all I have is a big string.
Using zlib.decompress(mystring)
gives error Error -3 while decompressing data: incorrect header check
Upvotes: 3
Views: 4746
Reputation: 82924
Confirming the comments by @reclosedev, and adding some more:
The bytes after the ]
need to be base64-decoded.
In the result of that, there are 4 bytes constituting the length of the decompressed data as a 32-bit little-endian binary number. The remainder is an RFC-1952-compliant gzip stream, recognisable by starting with 1F 8B 08
. The decompression results look like binary data, not strings of ASCII 1s and 0s.
Code:
lines = [
# extracted from the linked csv file
"[133,120,696,286]MmEAAB+LCAAAAAAABADtvQdg [BIG snip] a0bokyYQAA",
"[73,65,564,263]bkgAAB+LCAAAAAAABADtvQdgHE [BIG snip] kgAAA==",
]
import zlib, struct
for line in lines:
print
b64 = line.split(']')[1]
raw = b64.decode('base64')
print "unknown:", repr(raw[:4])
print "unknown as 32-bit LE int:", struct.unpack("<I", raw[:4])[0]
ungz = zlib.decompress(raw[4:], 31)
print len(ungz), "bytes in decompressed data"
print "first 100:", repr(ungz[:100])
Output:
unknown: '2a\x00\x00'
unknown as 32-bit LE int: 24882
24882 bytes in decompressed data
first 100: '\xff\xe0\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xff\xf0\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00'
unknown: 'nH\x00\x00'
unknown as 32-bit LE int: 18542
18542 bytes in decompressed data
first 100: '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x7f\xff\xff\xff\xff
\xff\xff\xff\xff\xff\xff\xff\xff\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x07\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\x80
\x00\x00\x00'
Update in response to comment
To get the 1s and 0s I needed I just added this to the above
cleaned = bin(int(binascii.hexlify(ungz), 16))
"Just"? You would need to strip off '0b'
from the front, and then pad the front with as many leading zeroes as necessary to make the length a multiple of 8. Example, with a better method:
>>> import binascii
>>> ungz = '\x01\x80'
>>> bin(int(binascii.hexlify(ungz), 16))
'0b110000000'
>>> ''.join('{0:08b}'.format(ord(x)) for x in ungz)
'0000000110000000'
Have you checked carefully to ensure that you really want '0000000110000000'
and not '1000000000000001'
?
Upvotes: 2