Calculate MD5 and SHA1 simultaneously on large file

Question

I'm writing a Python program to find duplicate files. Calculating MD5 and comparing file-size is not 100% fool-proof since it is possible for two different files to have the same file-size and MD5 (collision probability 2^128).

I was wondering then, perhaps if I add another hash such as SHA1 (2^160) or CRC32 (2^32) to the mix, would that greatly increase the ability to identify unique files, i.e. compare both MD5 and SHA1 of a file for uniqueness? Is SHA1 or CRC32 preferable for this secondary check?

If so, how do I calculate both MD5 and SHA1/CRC32 simultaneously while iterating thru 1MB blocks of a very large file to avoid reading the large file twice? This is what I have for MD5:

def md5(fname):
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(2 ** 20), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

Currently my code, after checking for identical MD5's, would run filecmp.cmp on both files as the final check. This is obviously resource intensive and not efficient.

I'm using SQLite to store these hashes. I assume it is slower than Python Lists but doesn't suffer from insufficient memory when processing billions of files.

Maciek · Accepted Answer

You already have the hardest part done. You just have to feed the chunks you read to another hasher:

def calculate_hashes(fname):
    hash_md5 = hashlib.md5()
    hash_sha1 = hashlib.sha1()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(2 ** 20), b""):
            hash_md5.update(chunk)
            hash_sha1.update(chunk)
    return hash_md5.hexdigest(), hash_sha1.hexdigest()

Calculate MD5 and SHA1 simultaneously on large file

Answers (1)

Related Questions