Reputation: 69
I have several gz files on which decompression seems to not be properly working. I have selected one of those, this file is in a local folder and it came from an external source. I don't know the compression process.
I have created a python script to illustrate the situation. In order to have a reference file for testing I have uncompressed the gz file and compressed it again (using gzip on Ubuntu) to generate the same gz file in my computer. Those two files present different behaviors:
import gzip
import zlib
import hashlib
def md5(content):
m = hashlib.md5()
m.update(content)
return m.hexdigest()
def decompress_gzip_size(file_name):
with gzip.open(file_name, 'rb') as f_out:
f_content = f_out.read()
print(len(f_content), file_name)
print (md5(f_content), file_name)
def decompress_open_gzip_size(file_name):
with open(file_name, 'rb') as f_out:
f_content = f_out.read()
unzip_content = gzip.decompress(f_content)
print(len(unzip_content), file_name)
print (md5(unzip_content), file_name)
def decompress_zlib_size(file_name):
with open(file_name, 'rb') as f_out:
f_content = f_out.read()
unzip_content = zlib.decompress(f_content, 32)
print(len(unzip_content), file_name)
print (md5(unzip_content), file_name)
def decompress_zlib_obj(file_name):
decompress_obj = zlib.decompressobj(32)
with open(file_name, 'rb') as f_out:
f_content = f_out.read()
unzip_content = decompress_obj.decompress(f_content)
print(len(unzip_content), file_name)
print(len(decompress_obj.unused_data), 'Unused data')
print (md5(unzip_content), file_name)
external_file = 'external_source_compress.gz'
my_file = 'my-compress-file.gz'
print("decompress_gzip_size")
decompress_gzip_size(my_file)
decompress_gzip_size(external_file)
print("*" * 60)
print("decompress_open_gzip_size")
decompress_open_gzip_size(my_file)
decompress_open_gzip_size(external_file)
print("*" * 60)
print("decompress_zlib_size")
decompress_zlib_size(my_file)
decompress_zlib_size(external_file)
print("*" * 60)
print("decompress_zlib_obj")
decompress_zlib_obj(my_file)
decompress_zlib_obj(external_file)
print("*" * 60)
Execution output is:
decompress_gzip_size
167019534 my-compress-file.gz
a4dd17dd28b89f0b2c300b607cd1a8ba my-compress-file.gz
167019534 external_source_compress.gz
a4dd17dd28b89f0b2c300b607cd1a8ba external_source_compress.gz
************************************************************
decompress_open_gzip_size
167019534 my-compress-file.gz
a4dd17dd28b89f0b2c300b607cd1a8ba my-compress-file.gz
167019534 external_source_compress.gz
a4dd17dd28b89f0b2c300b607cd1a8ba external_source_compress.gz
************************************************************
decompress_zlib_size
167019534 my-compress-file.gz
a4dd17dd28b89f0b2c300b607cd1a8ba my-compress-file.gz
33408639 external_source_compress.gz
4f51ccc64a7baab5ee5e2ce31e816409 external_source_compress.gz
### SIZES AND MD5 DO NOT MATCH ###
************************************************************
decompress_zlib_obj
167019534 my-compress-file.gz
0 Unused data
a4dd17dd28b89f0b2c300b607cd1a8ba my-compress-file.gz
33408639 external_source_compress.gz
46765202 Unused data
4f51ccc64a7baab5ee5e2ce31e816409 external_source_compress.gz
### THERE IS SOME UNUSED DATA IN THE ORIGINAL FILE ###
************************************************************
NOTE: zlib decompression fails if window size parameters is any other than 32.
Without setting the window-size this happend:
Traceback (most recent call last):
File "decompress_python.py", line 53, in <module>
decompress_zlib_size(my_file)
File "decompress_python.py", line 26, in decompress_zlib_size
unzip_content = zlib.decompress(f_content)
zlib.error: Error -3 while decompressing data: incorrect header check
As it can be seen. Without 32 windows-size, the script crash, but with window-size 32 it finish, but the amount of read bytes is just a portion of the real amount of data.
File has only one element inside:
gzip -l external_source_compress.gz
compressed uncompressed ratio uncompressed_name
58609586 33410520 -75.4% external_source_compress
Could anyone help me to understand what is happening here? I'm truly lost. Thanks in advance.
Upvotes: 1
Views: 421
Reputation: 112404
Florian's answer is almost certainly what's happening, which is the gzip file you are only partially decompressing must have multiple members. To decompress the next member, you can simply get the unused_data
from the decompressobj
object and start another decompression. Repeat until all of the input is used.
This answer shows an example of processing a gzip file with multiple members in python.
It sounds like what you should be using for the wbits
parameter is 31, not 32. 31 expects a gzip wrapper as opposed to a zlib wrapper. Not specifying wbits
results in the decompressor expecting a zlib wrapper, which is why it doesn't work on your gzip streams.
Upvotes: 2
Reputation: 33719
gzip streams can consist of multiple items compressed independently, which are then simply concatenated. The expectation is that a gzip-style decompressor transparently reads through all independently compressed parts, producing a single output stream (basically ignoring the internal end-of-stream indicators).
The documentation of zlib.compressobj
does not say it, but it will stop at the first zlib-detected internal end of stream condition. The remaining data could then be processed independently. In some applications (but not yours), it may not even be zlib-compressed data.
Upvotes: 1