Reputation: 21
I just found python hashlib.md5 might be faster than coreutils md5sum.
python hashlib
def get_hash(fpath, algorithm='md5', block=32768):
if not hasattr(hashlib, algorithm):
return ''
m = getattr(hashlib, algorithm)()
if not os.path.isfile(fpath):
return ''
with open(fpath, 'r') as f:
while True:
data = f.read(block)
if not data:
break
m.update(data)
return m.hexdigest()
coreutils md5sum
def shell_hash(fpath, method='md5sum'):
if not os.path.isfile(fpath):
return ''
cmd = [method, fpath] #delete shlex
p = Popen(cmd, stdout=PIPE)
output, _ = p.communicate()
if p.returncode:
return ''
output = output.split()
return output[0]
There are 4 columns about my test results time of calculate md5 and sha1.
1th column are cal time of coreutils md5sum or sha1sum.
2th column are cal time of python hashlib md5 or sha1, by reading 1048576 chunk.
3th column are cal time of python hashlib md5 or sha1, by reading 32768 chunk.
4th column are cal time of python hashlib md5 or sha1, by reading 512 chunk.
4.08805298805 3.81827783585 3.72585606575 5.72505903244
6.28456497192 3.69725108147 3.59885907173 5.69266486168
4.08003306389 3.82310700417 3.74562311172 5.74706888199
6.25473690033 3.70099711418 3.60972714424 5.70108985901
4.07995700836 3.83335709572 3.74854302406 5.74988412857
6.26068210602 3.72050404549 3.60864400864 5.69080018997
4.08979201317 3.83872914314 3.75350999832 5.79242300987
6.28977203369 3.69586396217 3.60469412804 5.68853116035
4.0824379921 3.83340883255 3.74298214912 5.73846316338
6.27566385269 3.6986720562 3.6079480648 5.68188500404
4.10092496872 3.82357311249 3.73044300079 5.7778570652
6.25675201416 3.78636980057 3.62911510468 5.71392583847
4.09579920769 3.83730792999 3.73345088959 5.73320293427
6.26580905914 3.69428491592 3.61320495605 5.69155502319
4.09030103683 3.82516098022 3.73244214058 5.72749185562
6.26151800156 3.6951239109 3.60320997238 5.70400810242
4.07977604866 3.81951498985 3.73287010193 5.73037815094
6.26691818237 3.72077894211 3.60203289986 5.71795105934
4.08536100388 3.83897590637 3.73681998253 5.73614501953
6.2943251133 3.72131896019 3.61498594284 5.69963502884
(My computer has 4-core i3-2120 CPU @ 3.30GHz, 4G memory.
The file calculated by these program is about 2G in size.
The odd rows are about md5 and the even rows are about sha1.
The time in this table are in second.)
With more than 100 times test, I found python hashlib was always faster than md5sum or sha1sum.
I also read some docs in source code about Python2.7/Modules/{md5.c,md5.h,md5module.c} and gnulib lib/{md5.c,md5.h}. They are both implementation of MD5 (RFC 1321).
In gnulib, md5 chunk read by 32768.
I didn't know much about md5 and C source code. Could someone help me to explain these results?
The other reason why I want to ask this question is that many people think md5sum is faster than python_hashlib for granted and they prefer to use md5sum when writting python code. But it seems wrong.
Upvotes: 2
Views: 1344
Reputation: 31708
coreutils had it's own C implementation, whereas python calls out to libcrypto with architecture specific assembly implementations. The difference is even greater with sha1. Now this has been fixed up in coreutils-8.22 (when configured --with-openssl), and is enabled in newer distos like Fedora 21, RHEL 7 and Arch, etc.
Note calling out to the command even though currently slower on some systems is a better long term strategy as one can take advantage of all the logic encapsulated within the separate commands, rather than reimplementing. For example in coreutils there is pending support for improved reading of sparse files so that zeros are not redundantly read from the kernel etc. Better take advantage of that transparently if possible.
Upvotes: 3
Reputation: 309821
I'm not sure exactly how you're timing this, but the discrepancy is likely to be because of the time you spend spinning up a subprocess (consider the parsing time of shlex.split
as well) each time you call shell_hash
.
Upvotes: 1