Reputation: 539
How do you gzip/gunzip files using python at a speed comparable to the underlying libraries?
tl;dr - Use shutil.copyfileobj(f_in, f_out).
I'm decompressing *.gz files as part of a larger series of file processing, and profiling to try to get python to perform "close" to built in scripts. With the amount of data I'm working with, this matters, and it seems like a generally important thing to understand.
Using the 'gunzip' bash command on a ~500MB as follows yields:
$time gunzip data.gz -k
real 0m24.805s
A naive python implementation looks like:
with open('data','wb') as out:
with gzip.open('data.gz','rb') as fin:
s = fin.read()
out.write(s)
real 2m11.468s
Don't read the whole file into memory:
with open('data','wb') as out:
with gzip.open('data.gz','rb') as fin:
out.write(fin.read())
real 1m35.285s
Check the local machines buffer size:
>>> import io
>>> print io.DEFAULT_BUFFER_SIZE
8192
Use buffering:
with open('data','wb', 8192) as out:
with gzip.open('data.gz','rb', 8192) as fin:
out.write(fin.read())
real 1m19.965s
Use as much buffering as possible:
with open('data','wb',1024*1024*1024) as out:
with gzip.open('data.gz','rb', 1024*1024*1024) as fin:
out.write(fin.read())
real 0m50.427s
So clearly it is buffering/IO bound.
I have a moderately complex version that runs in 36sec, but involves a pre-allocated buffer and tight inner loop. I expect there's a "better way."
The code above is reasonable and clear, albeit still slower than a bash script. But if there's a solution that is extremely roundabout or complicated, it doesn't suit my needs. My main caveat is that I would like to see a "pythonic" answer.
Of course, there's always this solution:
subprocess.call(["gunzip","-k", "data.gz"])
real 0m24.332s
But for the purposes of this question, is there a faster way of processing files "pythonically".
Upvotes: 7
Views: 7657
Reputation: 539
I'm going to post my own answer. It turns out that you do need to use an intermediate buffer; python doesn't handle this terribly well for you. You do need to play around with the size of that buffer and the "default buffer size" does get the optimal solution. In my case a very large buffer (1GB) and a smaller than default (1KB) were slower.
Additionally, I tried the built in io.BufferedReader and io.BufferedWriter classes with their readinto() options, and found that was not necessary. (not entirely true, as the gzip library is a BufferedReader so provides this.)
import gzip
buf = bytearray(8192)
with open('data', 'wb') as fout:
with gzip.open('data.gz', 'rb') as fin:
while fin.readinto(buf):
fout.write(buf)
real 0m27.961s
While I suspect this is a known python pattern, seems there were a lot of people confused by this, so I will leave this here in hopes that it helps others.
@StefanPochmann got the correct answer. I hope he posts it and I will accept. The solution is:
import gzip
import shutil
with open('data', 'wb') as fout:
with gzip.open('data.gz', 'rb') as fin:
shutil.copyfileobj(fin,fout)
real 0m26.126s
Upvotes: 8