An Ignorant Wanderer
An Ignorant Wanderer

Reputation: 1612

gzip in bash vs python

In Bash, when you gzip a file, the original is not retained, whereas in Python, you could use the gzip library like this (as shown here in the "Examples of Usage" section):

import gzip
import shutil
with open('/home/joe/file.txt', 'rb') as f_in:
    with gzip.open('/home/joe/file.txt.gz', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

By default, this retains the original file. I couldn't find a way to not retain it while gzipping. Do I have to wait till gzip is done to delete the file?

Upvotes: 0

Views: 550

Answers (3)

alani
alani

Reputation: 13049

The code below (partially based on tdelaney's answer), will do the following:

  • read the file, compressing on the fly, and storing all the compressed data in memory
  • delete the input file
  • then write the compressed data

This is for the use case where you have a full filesystem, which prevents you from writing the compressed data at the same time that the uncompressed file exists on disk. To get around this problem, it is therefore necessary to store all the data in memory (unless you have access to external storage), but to minimise this memory cost as far as possible, only the compressed data is fully stored in memory, while the uncompressed data is read in chunks.

There is of course a risk of data loss if the program is interrupted between deleting the input file and completing writing the compressed data to disk.

There is also the possibility of failure if there is insufficient memory, but the input file would not be deleted in that case because the MemoryError would be raised before the os.unlink is reached.

It is worth noting that this does not specifically answer what the question asks for, namely deleting the input file while still reading from it. This is possible under unix-like OSes, but there is no practical advantage in doing this over the regular command-line gzip behaviour, because freeing the disk space still does not happen until the file is closed, so it sacrifices recoverability in the event of failure, without gaining any additional space to juggle data in exchange for that sacrifice. (There would still need to be disk space for the uncompressed and compressed data to coexist.)

import gzip
import shutil
import os
from io import BytesIO

filename = 'deleteme'

buf = BytesIO()

# compress into memory - don't store all the uncompressed data in memory
# but do store all the compressed data in memory
with open(filename, 'rb') as fin:
    with gzip.open(buf, 'wb') as zbuf:
        shutil.copyfileobj(fin, zbuf)

# sanity check for already compressed data
length = buf.tell()
if length > os.path.getsize(filename):
    raise RuntimeError("data *grew* in size - refusing to delete input")

# delete input file and then write out the compressed data
buf.seek(0)
os.unlink(filename)
with open(filename + '.gz', 'wb') as fout:
    shutil.copyfileobj(buf, fout)

Upvotes: 1

tdelaney
tdelaney

Reputation: 77337

If you are on a unix-like system, you can unlink the file after opening so that it is no longer found in the file system. But it will still take disk space until you close the now-anonymous file.

import gzip
import shutil
import os
with open('deleteme', 'rb') as f_in:
    with gzip.open('deleteme.gz', 'wb') as f_out:
        os.unlink('deleteme') # *after* we knew the gzip open worked!
        shutil.copyfileobj(f_in, f_out)

As far as I know, this doesn't work on Windows. You need to do the remove after the zip process completes. You could change its name to something like "thefile.temporary" or even move it to a different directory (fast if the directory is the same file system, but copied if its a different one).

Upvotes: 2

fyngyrz
fyngyrz

Reputation: 2658

Considering that when GZip runs (in Bash, or anywhere else for that matter):

  • GZip requires the original data to perform the zipping action
  • GZip is designed to handle data of basically arbitrary size
  • Therefore: GZip isn't likely to be creating a temp file in memory, rather it is almost certainly deleting the original after the gzip is done anyway.

With those points in mind, an identical strategy for your code is to do the gzip, then delete the file.

Certainly deleting the file isn't onerous — there are several ways to do it — and of course you could package the whole thing in a procedure so as to never have to concern yourself with it again.

Upvotes: 2

Related Questions