XPGeek
XPGeek

Reputation: 21

hashlib - MD5 and Blake2b hashing speed equal, is something wrong?

I wrote a simple test that does the following:

rootDirectories = ["/bin","/etc","/sbin","/lib","/lib32","/lib64"]

def walkDirectory(rootDirectory, ignoreList):
    try:
        directories = dict()
        for dirName, subdirList, fileList in os.walk(rootDirectory):
            if (ignoreList):
                for ignore in ignoreList:
                    if (ignore in fileList):
                        fileList.remove(ignore)
            directories[str(dirName)] = fileList
        return directories
    except Exception as e:
        print("Error while scanning directories and files")
def generateMd5(directory, blockSize): 
    try:
        hash = hashlib.md5()
        with open(directory,"rb") as f:
            for block in iter(lambda: f.read(blockSize),b""):
                hash.update(block)
                # OR - pass (I/O Only)
            return hash.hexdigest()
    except Exception as e:
        print("Error while taking the hash values" + str(e))

def generateBlake2(directory, blockSize):
    try:
        hash = hashlib.blake2b()
        with open(directory,"rb") as f:
            for block in iter(lambda: f.read(blockSize),b""):
                hash.update(block)
                # OR - pass (I/O Only)
            return hash.hexdigest()
    except Exception as e:
        print("Error while taking the hash values"+ str(e))

Blake2 is said to be ~40% faster on Intel CPUs. However, performance seems similar across a variety of block sizes.

Here's my test results, any averages were taken across 50 runs:

I/O ONLY - Empty hashing function (pass)
Hash:         Block Size (bytes):         Seconds (avg):           Files Hashed (avg):
MD5              4096                1.0658537864685058             15666.0
MD5              8192                0.7631869792938233             15666.0
MD5              16384                0.679033899307251             15666.0
MD5              32768                0.6130096673965454             15666.0
MD5              65536                0.5926639556884765             15666.0
MD5              131072                0.6072390222549439             15666.0
MD5              262144                0.6025748205184936             15666.0
MD5              524288                0.629586148262024             15666.0
MD5              1048576                0.6911558246612549             15666.0
MD5              2097152                0.7710119438171387             15666.0
MD5              4194304                0.7501423931121827             15666.0
MD5              8388608                0.7613127708435059             15666.0
MD5              16777216                0.8549494647979736             15666.0
MD5              33554432                1.370493221282959             15666.0
MD5              67108864                1.3940581130981444             15666.0
Blake2              4096              0.9557385349273682              15666.0
Blake2              8192              0.775569143295288              15666.0
Blake2              16384              0.6793924331665039              15666.0
Blake2              32768              0.6285490798950195              15666.0
Blake2              65536              0.6092999029159546              15666.0
Blake2              131072              0.6079203844070434              15666.0
Blake2              262144              0.6007542181015014              15666.0
Blake2              524288              0.5933445692062378              15666.0
Blake2              1048576              0.5961050319671631              15666.0
Blake2              2097152              0.6041217613220214              15666.0
Blake2              4194304              0.6066651153564453              15666.0
Blake2              8388608              0.6075587749481202              15666.0
Blake2              16777216              0.6483844709396362              15666.0
Blake2              33554432              1.3640736722946167              15666.0
Blake2              67108864              1.3651361989974975              15666.0


I/O + HASH Generation
Hash:         Block Size (bytes):         Seconds (avg):           Files Hashed (avg):
MD5              4096                3.174709644317627             15666.0
MD5              8192                2.9633057641983034             15666.0
MD5              16384                2.8538199186325075             15666.0
MD5              32768                2.8014142322540283             15666.0
MD5              65536                2.7751221895217895             15666.0
MD5              131072                2.7487613582611083             15666.0
MD5              262144                2.760332064628601             15666.0
MD5              524288                2.7834008264541628             15666.0
MD5              1048576                2.8470048809051516             15666.0
MD5              2097152                2.9602468061447142             15666.0
MD5              4194304                2.8882385206222536             15666.0
MD5              8388608                2.9489006996154785             15666.0
MD5              16777216                2.970426073074341             15666.0
MD5              33554432                3.6108012914657595             15666.0
MD5              67108864                3.5666714572906493             15666.0
Blake2              4096              3.3631218385696413              15666.0
Blake2              8192              3.1597275495529176              15666.0
Blake2              16384              3.0446608781814577              15666.0
Blake2              32768              3.0090284061431887              15666.0
Blake2              65536              2.9616037464141844              15666.0
Blake2              131072              2.9736635446548463              15666.0
Blake2              262144              2.959379949569702              15666.0
Blake2              524288              2.972061824798584              15666.0
Blake2              1048576              2.945613770484924              15666.0
Blake2              2097152              2.956157364845276              15666.0
Blake2              4194304              2.9658774614334105              15666.0
Blake2              8388608              2.970878791809082              15666.0
Blake2              16777216              3.064859108924866              15666.0
Blake2              33554432              3.8108808422088623              15666.0
Blake2              67108864              3.7806590032577514              15666.0

Are my results I/O constrained? Why would speed be so similar across the different algorithms and block sizes?

cat /proc/cpu

vendor_id       : GenuineIntel
model name      : Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz

hdparm -I /dev/sda

/dev/sda:

ATA device, with non-removable media
        Model Number:       Micron_5100_MTFDDAK960TBY

Upvotes: 2

Views: 1010

Answers (1)

Jan Krüger
Jan Krüger

Reputation: 18530

Your code is fine. This happens because the vast majority of time is not actually spent hashing. What you're seeing is almost entirely overhead:

  1. Reading the actual data from disk, while comparably fast if it's a solid state drive, is still much slower than processing it in RAM... especially because all of the iterating over directories and files involves many, many lookups, even if all of the data is in the operating system's filesystem caches after the first run (which seems unlikely).

  2. Most likely the main overhead is because this is Python, an interpreted language, orders of magnitude slower than doing this in a more systems-level language. Not only is code/bytecode getting interpreted, but also there are many object updates and plenty of memory allocation, re-allocation and de-allocation, all of which happens under the hood with no way for you to influence most of it.

  3. If you time your runs outside of the Python code, you get additional massive overhead from the start-up phase of the Python interpreter.

You'd probably get slightly more relevant results if you tested this on a single large file, simply because this eliminates a fair proportion of overhead.

Upvotes: 2

Related Questions