Delgan
Delgan

Reputation: 19627

Python "gzip" module acting weirdly if compressed extension is not ".gz"

I need to compress a file using the gzip module, but the output file extension may not be .gz.

Look at this simple code:

import gzip
import shutil

input_path = "test.txt"
output_path = input_path + ".gz"

with open(input_path, 'w') as file:
    file.write("abc" * 10)

with gzip.open(output_path, 'wb') as f_out:
    with open(input_path, 'rb') as f_in:
        shutil.copyfileobj(f_in, f_out)

It works fine. But if I replace ".gz" with ".gzip" for example, then I am not able to open the compressed file correctly:

Uncompressing not working

I tried with 7-Zip and WinRar, the result is the same, and the bug persists even if I rename the file.

Does anyone know where the problem comes from, please?

I tried with compression bz2 and lzma, they seem to work properly no matter what the extension is.

Upvotes: 4

Views: 2523

Answers (2)

Roland Pihlakas
Roland Pihlakas

Reputation: 4573

The filename inside the archive can be controlled by utilising gzip.GzipFile constructor instead of the gzip.open method. The gzip.GzipFile needs then a separate os.open call before it.

with open(output_path, 'wb') as f_out_gz:
    with gzip.GzipFile(fileobj=f_out_gz, filename=input_path, mode='wb') as f_out: 
        ...
        f_out.flush()

Note also the added f_out.flush() - according to my experience without this line the GzipFile may in some cases randomly not flush the data before the file is closed, resulting in corrupt archive.

Or as a complete example:

import gzip
import shutil

input_path = "test.txt"
output_path = input_path + ".gz"

with open(input_path, 'w') as file:
    file.write("abc" * 10)

with open(output_path, 'wb') as f_out_gz:
    with gzip.GzipFile(fileobj=f_out_gz, filename=input_path, mode='wb') as f_out
        with open(input_path, 'rb') as f_in:
            shutil.copyfileobj(f_in, f_out)
            f_out.flush()

Upvotes: 3

zvone
zvone

Reputation: 19352

You actually have two versions of file created this way:

First, .gz file:

with gzip.open("test.txt.gz", 'wb') as f_out:
    with open("test.txt", 'rb') as f_in:
        shutil.copyfileobj(f_in, f_out)

Second, .gzip file:

with gzip.open("test.txt.gzip", 'wb') as f_out:
    with open("test.txt", 'rb') as f_in:
        shutil.copyfileobj(f_in, f_out)

Both create a GZIP with your test.txt in it. The only difference is that in the second case, test.txt is renamed to test.txt.gzip.


The problem is that the argument to gzip.open actually has two purposes: the filename of the gzip archive and the filename of the file inside (bad design, imho).

So, if you do gzip.open("abcd", 'wb') and write to it, it will create gzip archive named abcd with a file named abcd inside.

But then, there comes magic: if the filename endswith .gz, then it behaves differently, e.g. gzip.open("bla.gz", 'wb') creates a gzip archive named bla.gz with a file named bla inside.

So, with .gz you activated the (undocumented, as far as I can see!) magic, whereas with .gzip you did not.

Upvotes: 4

Related Questions