JHiant
JHiant

Reputation: 539

Faster, better gunzip (and general file input/output) in python

How do you gzip/gunzip files using python at a speed comparable to the underlying libraries?

tl;dr - Use shutil.copyfileobj(f_in, f_out).

I'm decompressing *.gz files as part of a larger series of file processing, and profiling to try to get python to perform "close" to built in scripts. With the amount of data I'm working with, this matters, and it seems like a generally important thing to understand.

Using the 'gunzip' bash command on a ~500MB as follows yields:

$time gunzip data.gz -k

real    0m24.805s

A naive python implementation looks like:

with open('data','wb') as out:
    with gzip.open('data.gz','rb') as fin:
        s = fin.read()
        out.write(s)

real    2m11.468s

Don't read the whole file into memory:

with open('data','wb') as out:
    with gzip.open('data.gz','rb') as fin:
        out.write(fin.read())

real    1m35.285s

Check the local machines buffer size:

>>> import io
>>> print io.DEFAULT_BUFFER_SIZE
8192

Use buffering:

with open('data','wb', 8192) as out:
    with gzip.open('data.gz','rb', 8192) as fin:
        out.write(fin.read())

real    1m19.965s

Use as much buffering as possible:

with open('data','wb',1024*1024*1024) as out:
    with gzip.open('data.gz','rb', 1024*1024*1024) as fin:
        out.write(fin.read())

real    0m50.427s

So clearly it is buffering/IO bound.

I have a moderately complex version that runs in 36sec, but involves a pre-allocated buffer and tight inner loop. I expect there's a "better way."

The code above is reasonable and clear, albeit still slower than a bash script. But if there's a solution that is extremely roundabout or complicated, it doesn't suit my needs. My main caveat is that I would like to see a "pythonic" answer.

Of course, there's always this solution:

subprocess.call(["gunzip","-k", "data.gz"])

real    0m24.332s

But for the purposes of this question, is there a faster way of processing files "pythonically".

Upvotes: 7

Views: 7657

Answers (1)

JHiant
JHiant

Reputation: 539

I'm going to post my own answer. It turns out that you do need to use an intermediate buffer; python doesn't handle this terribly well for you. You do need to play around with the size of that buffer and the "default buffer size" does get the optimal solution. In my case a very large buffer (1GB) and a smaller than default (1KB) were slower.

Additionally, I tried the built in io.BufferedReader and io.BufferedWriter classes with their readinto() options, and found that was not necessary. (not entirely true, as the gzip library is a BufferedReader so provides this.)

import gzip

buf = bytearray(8192)
with open('data', 'wb') as fout:
    with gzip.open('data.gz', 'rb') as fin:
        while fin.readinto(buf):
            fout.write(buf)

real    0m27.961s

While I suspect this is a known python pattern, seems there were a lot of people confused by this, so I will leave this here in hopes that it helps others.

@StefanPochmann got the correct answer. I hope he posts it and I will accept. The solution is:

import gzip
import shutil
with open('data', 'wb') as fout:
    with gzip.open('data.gz', 'rb') as fin:
        shutil.copyfileobj(fin,fout)

real    0m26.126s

Upvotes: 8

Related Questions