Reputation: 601
I am writing a module that needs to be able to deal with a large number of zip file pretty fast. As such, I was going to use something implemented in C rather than Python (from which I'll be calling the extractor). To try and test which method would be fastest, I wrote a test script comparing linux's 'unzip' command vs the czipfile python module (wrapper around a c zip extractor). As a control, I used the native python zipfile module.
The script creates a zipfile that's around 100MB out of 100 ~1MB files. It looks at 3 scenarios. A) The files are all just random bytestrings. B)The files are just random hex characters C)The files are uniform random sentences with line breaks.
In all cases, the performance of zipfile (implemented in python) was on par with or significantly better than the two extractors implemented in c.
Any ideas why this could be happening? The script is attached. Requires czipfile and the 'unzip' command available in the shell.
from datetime import datetime
import zipfile
import czipfile
import os, binascii, random
class ZipTestError(Exception):
pass
class ZipTest:
procs = ['zipfile', 'czipfile', 'os']
type_map = {'r':'Random', 'h':'Random Hex', 's':'Sentences'}
# three types. t=='r' is random noise files directly out of urandom. t=='h' is urandom noise converted to ascii characters. t=='s' are randomly constructed sentences with line breaks.
def __init__(self):
print """Testing Random Byte Files:
"""
self.test('r')
self.test('h')
self.test('s')
@staticmethod
def rand_name():
return binascii.b2a_hex(os.urandom(10))
def make_file(self, t):
f_name = self.rand_name()
f = open(f_name, 'w')
if t == 'r':
f.write(os.urandom(1048576))
elif t == 'h':
f.write(binascii.b2a_hex(os.urandom(1048576)))
elif t == 's':
for i in range(76260):
ops = ['dog', 'cat', 'rat']
ops2 = ['meat', 'wood', 'fish']
n1 = int(random.random()*10) % 3
n2 = int(random.random()*10) % 3
sentence = """The {0} eats {1}
""".format(ops[n1], ops2[n2])
f.write(sentence)
else:
raise ZipTestError('Invalid Type')
f.close()
return f_name
#create a ~100MB zip file to test extraction on.
def create_zip_test(self, t):
self.file_names = []
self.output_names = []
for i in range(100):
self.file_names.append(self.make_file(t))
self.zip_name = self.rand_name()
output = zipfile.ZipFile(self.zip_name, 'w', zipfile.ZIP_DEFLATED)
for f in self.file_names:
output.write(f)
output.close()
def clean_up(self, rem_zip = False):
for f in self.file_names:
os.remove(f)
self.file_names = []
for f in self.output_names:
os.remove(f)
self.output_names = []
if rem_zip:
if getattr(self, 'zip_name', False):
os.remove(self.zip_name)
self.zip_name = False
def display_res(self, res, t):
print """
{0} results:
""".format(self.type_map[t])
for p in self.procs:
print"""
{0} = {1} milliseconds""".format(p, str(res[p]))
def test(self, t):
self.create_zip_test(t)
res = self.unzip()
self.display_res(res, t)
self.clean_up(rem_zip = True)
def unzip(self):
res = dict()
for p in self.procs:
self.clean_up()
res[p] = getattr(self, "unzip_with_{0}".format(p))()
return res
def unzip_with_zipfile(self):
return self.unzip_with_python(zipfile)
def unzip_with_czipfile(self):
return self.unzip_with_python(czipfile)
def unzip_with_python(self, mod):
f = open(self.zip_name)
zf = mod.ZipFile(f)
start = datetime.now()
op = './'
for name in zf.namelist():
zf.extract(name,op)
self.output_names.append(name)
end = datetime.now()
total = end-start
ms = total.microseconds
ms += (total.seconds) * 1000000
return ms /1000
def unzip_with_os(self):
f = open(self.zip_name)
start = datetime.now()
zf = zipfile.ZipFile(f)
for name in zf.namelist():
self.output_names.append(name)
os.system("unzip -qq {0}".format(f.name))
end = datetime.now()
total = end-start
ms = total.microseconds
ms += (total.seconds) * 1000000
return ms /1000
if __name__ == '__main__':
ZipTest()
Upvotes: 2
Views: 2939
Reputation: 997
Even if C is generally faster than interpreted languages, assuming the algorithm is the same, different buffering strategies can make a difference. Here's some evidence:
I made a couple of changes to your script. The diff is below.
I made the stopwatch start just before os.system
. This change is not noticeable, since reading off entries from the Central Directory is quick. So I saved the zip files and measured unzip time with the time
shell builtin, outside Python. The result shows that the overhead of firing up new processes doesn't matter so much.
A more interesting change is the addition of libarchive. The results I obtained are like so (milliseconds):
Random Hex Sentences
zipfile 368 1909 604
czipfile 241 1600 2313
os 707 2225 784
shell-measured 797 2272 737
libarchive 248 1513 451
EXTRACTION METHOD
Note that results vary by some milliseconds every time. The shell measures real, user, and sys time (see What do 'real', 'user' and 'sys' mean in the output of time(1)?). The figures above reflect real time, for consistency with other measurements.
A better analysis of what system calls unzip issues can be achieved by strace -c -w
. It shows a spike of reads for Hex:
Random Hex Sentences
read 805 14597 12816
write 2600 3200 1600
SYSTEM CALLS ISSUED BY unzip
Now for the diff (it assumes the original script is named ziptest.py
in the same directory where you run patch < _diff_
, see patch, diff)
--- ziptest.py.orig 2017-05-25 10:36:03.106994889 +0200
+++ ziptest.py 2017-05-25 11:30:42.032598259 +0200
@@ -2,6 +2,7 @@
import zipfile
import czipfile
import os, binascii, random
+import libarchive.public
class ZipTestError(Exception):
pass
@@ -10,7 +11,7 @@
class ZipTest:
- procs = ['zipfile', 'czipfile', 'os']
+ procs = ['zipfile', 'czipfile', 'os', 'libarchive']
type_map = {'r':'Random', 'h':'Random Hex', 's':'Sentences'}
# three types. t=='r' is random noise files directly out of urandom. t=='h' is urandom noise converted to ascii characters. t=='s' are randomly constructed sentences with line breaks.
@@ -119,10 +120,10 @@
def unzip_with_os(self):
f = open(self.zip_name)
- start = datetime.now()
zf = zipfile.ZipFile(f)
for name in zf.namelist():
self.output_names.append(name)
+ start = datetime.now()
os.system("unzip -qq {0}".format(f.name))
end = datetime.now()
total = end-start
@@ -130,7 +131,15 @@
ms += (total.seconds) * 1000000
return ms /1000
-
+ def unzip_with_libarchive(self):
+ start = datetime.now()
+ for entry in libarchive.public.file_pour(self.zip_name):
+ self.output_names.append(str(entry))
+ end = datetime.now()
+ total = end-start
+ ms = total.microseconds
+ ms += (total.seconds) * 1000000
+ return ms /1000
Upvotes: 1
Reputation: 601
As was pointed out above, the decryption is done in python, not the decompression. So zipfile is just using a c implementation like the other two.
Upvotes: 2