godzilla
godzilla

Reputation: 3125

gzip a file quicker using Python?

I am attempting to gzip a file using python faster as some of my files are as as small as 30 MB and as large as 4 GB.

Is there a more efficient way of creating a gzip file than the following? Is there a way to optimize the following so that if a file is small enough to be placed in memory it can simply just read the whole chunk of the file to be read rather than do it on a per line basis?

with open(j, 'rb') as f_in:
    with gzip.open(j + ".gz", 'wb') as f_out:
        f_out.writelines(f_in)

Upvotes: 3

Views: 6143

Answers (3)

tdelaney
tdelaney

Reputation: 77337

Copy the file in bigger chunks using the shutil.copyfileobj() function. In this example, I'm using 16MiB blocks which is pretty reasonable.

MEG = 2**20
with open(j, 'rb') as f_in:
    with gzip.open(j + ".gz", 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out, length=16*MEG)

You may find that calling out to gzip is faster for large files, especially if you plan to zip multiple files in parallel.

Upvotes: 5

tryptofame
tryptofame

Reputation: 392

Find 2 almost identical methods for reading gzip files below:

  • A.) to load everything into memory --> can be a bad choice for very big files (several GB), because you can run out of memory
  • B.) Don't load everything into memory, line by line --> good for BIG files

adapted from https://codebright.wordpress.com/2011/03/25/139/ and https://www.reddit.com/r/Python/comments/2olhrf/fast_gzip_in_python/ http://pastebin.com/dcEJRs1i

import sys
if sys.version.startswith("3"):
    import io
    io_method = io.BytesIO
else:
    import cStringIO
    io_method = cStringIO.StringIO

A.)

def yield_line_gz_file(fn):
    """
    :param fn: String (absolute path)
    :return: GeneratorFunction (yields String)
    """
    ph = subprocess.Popen(["gzcat", fn], stdout=subprocess.PIPE)
    fh = io_method(ph.communicate()[0])
    for line in fh:
        yield line

B.)

def yield_line_gz_file(fn):
    """
    :param fn: String (absolute path)
    :return: GeneratorFunction (yields String)
    """
    ph = subprocess.Popen(["gzcat", fn], stdout=subprocess.PIPE)
    for line in ph.stdout:
        yield line

Upvotes: 0

Amit
Amit

Reputation: 20456

Instead of reading it line by line, you can read it at once. Example:

import gzip
with open(j, 'rb') as f_in:
    content = f_in.read()
f = gzip.open(j + '.gz', 'wb')
f.write(content)
f.close()

Upvotes: 0

Related Questions