Reputation: 21
I wrote a simple test that does the following:
rootDirectories = ["/bin","/etc","/sbin","/lib","/lib32","/lib64"]
def walkDirectory(rootDirectory, ignoreList):
try:
directories = dict()
for dirName, subdirList, fileList in os.walk(rootDirectory):
if (ignoreList):
for ignore in ignoreList:
if (ignore in fileList):
fileList.remove(ignore)
directories[str(dirName)] = fileList
return directories
except Exception as e:
print("Error while scanning directories and files")
def generateMd5(directory, blockSize):
try:
hash = hashlib.md5()
with open(directory,"rb") as f:
for block in iter(lambda: f.read(blockSize),b""):
hash.update(block)
# OR - pass (I/O Only)
return hash.hexdigest()
except Exception as e:
print("Error while taking the hash values" + str(e))
def generateBlake2(directory, blockSize):
try:
hash = hashlib.blake2b()
with open(directory,"rb") as f:
for block in iter(lambda: f.read(blockSize),b""):
hash.update(block)
# OR - pass (I/O Only)
return hash.hexdigest()
except Exception as e:
print("Error while taking the hash values"+ str(e))
Blake2 is said to be ~40% faster on Intel CPUs. However, performance seems similar across a variety of block sizes.
Here's my test results, any averages were taken across 50 runs:
I/O ONLY - Empty hashing function (pass)
Hash: Block Size (bytes): Seconds (avg): Files Hashed (avg):
MD5 4096 1.0658537864685058 15666.0
MD5 8192 0.7631869792938233 15666.0
MD5 16384 0.679033899307251 15666.0
MD5 32768 0.6130096673965454 15666.0
MD5 65536 0.5926639556884765 15666.0
MD5 131072 0.6072390222549439 15666.0
MD5 262144 0.6025748205184936 15666.0
MD5 524288 0.629586148262024 15666.0
MD5 1048576 0.6911558246612549 15666.0
MD5 2097152 0.7710119438171387 15666.0
MD5 4194304 0.7501423931121827 15666.0
MD5 8388608 0.7613127708435059 15666.0
MD5 16777216 0.8549494647979736 15666.0
MD5 33554432 1.370493221282959 15666.0
MD5 67108864 1.3940581130981444 15666.0
Blake2 4096 0.9557385349273682 15666.0
Blake2 8192 0.775569143295288 15666.0
Blake2 16384 0.6793924331665039 15666.0
Blake2 32768 0.6285490798950195 15666.0
Blake2 65536 0.6092999029159546 15666.0
Blake2 131072 0.6079203844070434 15666.0
Blake2 262144 0.6007542181015014 15666.0
Blake2 524288 0.5933445692062378 15666.0
Blake2 1048576 0.5961050319671631 15666.0
Blake2 2097152 0.6041217613220214 15666.0
Blake2 4194304 0.6066651153564453 15666.0
Blake2 8388608 0.6075587749481202 15666.0
Blake2 16777216 0.6483844709396362 15666.0
Blake2 33554432 1.3640736722946167 15666.0
Blake2 67108864 1.3651361989974975 15666.0
I/O + HASH Generation
Hash: Block Size (bytes): Seconds (avg): Files Hashed (avg):
MD5 4096 3.174709644317627 15666.0
MD5 8192 2.9633057641983034 15666.0
MD5 16384 2.8538199186325075 15666.0
MD5 32768 2.8014142322540283 15666.0
MD5 65536 2.7751221895217895 15666.0
MD5 131072 2.7487613582611083 15666.0
MD5 262144 2.760332064628601 15666.0
MD5 524288 2.7834008264541628 15666.0
MD5 1048576 2.8470048809051516 15666.0
MD5 2097152 2.9602468061447142 15666.0
MD5 4194304 2.8882385206222536 15666.0
MD5 8388608 2.9489006996154785 15666.0
MD5 16777216 2.970426073074341 15666.0
MD5 33554432 3.6108012914657595 15666.0
MD5 67108864 3.5666714572906493 15666.0
Blake2 4096 3.3631218385696413 15666.0
Blake2 8192 3.1597275495529176 15666.0
Blake2 16384 3.0446608781814577 15666.0
Blake2 32768 3.0090284061431887 15666.0
Blake2 65536 2.9616037464141844 15666.0
Blake2 131072 2.9736635446548463 15666.0
Blake2 262144 2.959379949569702 15666.0
Blake2 524288 2.972061824798584 15666.0
Blake2 1048576 2.945613770484924 15666.0
Blake2 2097152 2.956157364845276 15666.0
Blake2 4194304 2.9658774614334105 15666.0
Blake2 8388608 2.970878791809082 15666.0
Blake2 16777216 3.064859108924866 15666.0
Blake2 33554432 3.8108808422088623 15666.0
Blake2 67108864 3.7806590032577514 15666.0
Are my results I/O constrained? Why would speed be so similar across the different algorithms and block sizes?
cat /proc/cpu
vendor_id : GenuineIntel
model name : Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
hdparm -I /dev/sda
/dev/sda:
ATA device, with non-removable media
Model Number: Micron_5100_MTFDDAK960TBY
Upvotes: 2
Views: 1010
Reputation: 18530
Your code is fine. This happens because the vast majority of time is not actually spent hashing. What you're seeing is almost entirely overhead:
Reading the actual data from disk, while comparably fast if it's a solid state drive, is still much slower than processing it in RAM... especially because all of the iterating over directories and files involves many, many lookups, even if all of the data is in the operating system's filesystem caches after the first run (which seems unlikely).
Most likely the main overhead is because this is Python, an interpreted language, orders of magnitude slower than doing this in a more systems-level language. Not only is code/bytecode getting interpreted, but also there are many object updates and plenty of memory allocation, re-allocation and de-allocation, all of which happens under the hood with no way for you to influence most of it.
If you time your runs outside of the Python code, you get additional massive overhead from the start-up phase of the Python interpreter.
You'd probably get slightly more relevant results if you tested this on a single large file, simply because this eliminates a fair proportion of overhead.
Upvotes: 2