Reputation: 3125
I am attempting to gzip a file using python faster as some of my files are as as small as 30 MB and as large as 4 GB.
Is there a more efficient way of creating a gzip file than the following? Is there a way to optimize the following so that if a file is small enough to be placed in memory it can simply just read the whole chunk of the file to be read rather than do it on a per line basis?
with open(j, 'rb') as f_in:
with gzip.open(j + ".gz", 'wb') as f_out:
f_out.writelines(f_in)
Upvotes: 3
Views: 6143
Reputation: 77337
Copy the file in bigger chunks using the shutil.copyfileobj()
function. In this example, I'm using 16MiB blocks which is pretty reasonable.
MEG = 2**20
with open(j, 'rb') as f_in:
with gzip.open(j + ".gz", 'wb') as f_out:
shutil.copyfileobj(f_in, f_out, length=16*MEG)
You may find that calling out to gzip
is faster for large files, especially if you plan to zip multiple files in parallel.
Upvotes: 5
Reputation: 392
Find 2 almost identical methods for reading gzip files below:
adapted from https://codebright.wordpress.com/2011/03/25/139/ and https://www.reddit.com/r/Python/comments/2olhrf/fast_gzip_in_python/ http://pastebin.com/dcEJRs1i
import sys if sys.version.startswith("3"): import io io_method = io.BytesIO else: import cStringIO io_method = cStringIO.StringIO
A.)
def yield_line_gz_file(fn):
"""
:param fn: String (absolute path)
:return: GeneratorFunction (yields String)
"""
ph = subprocess.Popen(["gzcat", fn], stdout=subprocess.PIPE)
fh = io_method(ph.communicate()[0])
for line in fh:
yield line
B.)
def yield_line_gz_file(fn):
"""
:param fn: String (absolute path)
:return: GeneratorFunction (yields String)
"""
ph = subprocess.Popen(["gzcat", fn], stdout=subprocess.PIPE)
for line in ph.stdout:
yield line
Upvotes: 0
Reputation: 20456
Instead of reading it line by line, you can read it at once. Example:
import gzip
with open(j, 'rb') as f_in:
content = f_in.read()
f = gzip.open(j + '.gz', 'wb')
f.write(content)
f.close()
Upvotes: 0