Andrey Valentsov
Andrey Valentsov

Reputation: 41

How to compute several hashes at the same time?

I want to computes multiple hashes of the same file and save time by multiprocessing.

From what I see, reading a file from ssd is relatively fast, but hash computing is almost 4 times slower. If I want to compute 2 different hashes (md5 and sha), it's 8 times slower. I'd like to able to compute different hashes on different processor cores in parallel (up to 4, depending on the settings), but don't understand how I can get around GIL.

Here are is my current code (hash.py):

import hashlib
from io import DEFAULT_BUFFER_SIZE

file = 'test/file.mov' #50MG file

def hash_md5(file):
    md5 = hashlib.md5()
    with open(file, mode='rb') as fl:
        chunk = fl.read(DEFAULT_BUFFER_SIZE)
        while chunk:
            md5.update(chunk)
            chunk = fl.read(DEFAULT_BUFFER_SIZE)
    return md5.hexdigest()

def hash_sha(file):
    sha = hashlib.sha1()
    with open(file, mode='rb') as fl:
        chunk = fl.read(DEFAULT_BUFFER_SIZE)
        while chunk:
            sha.update(chunk)
            chunk = fl.read(DEFAULT_BUFFER_SIZE)
    return sha.hexdigest()

def hash_md5_sha(file):
    md5 = hashlib.md5()
    sha = hashlib.sha1()
    with open(file, mode='rb') as fl:
        chunk = fl.read(DEFAULT_BUFFER_SIZE)
        while chunk:
            md5.update(chunk)
            sha.update(chunk)
            chunk = fl.read(DEFAULT_BUFFER_SIZE)
    return md5.hexdigest(), sha.hexdigest()

def read_file(file):
    with open(file, mode='rb') as fl:
        chunk = fl.read(DEFAULT_BUFFER_SIZE)
        while chunk:
            chunk = fl.read(DEFAULT_BUFFER_SIZE)
    return

I did some tests and here are the results:

from hash import *
from timeit import timeit
timeit(stmt='read_file(file)',globals=globals(),number = 100)
1.6323043460000122
>>> timeit(stmt='hash_md5(file)',globals=globals(),number = 100)
8.137973076999998
>>> timeit(stmt='hash_sha(file)',globals=globals(),number = 100)
7.1260356809999905
>>> timeit(stmt='hash_md5_sha(file)',globals=globals(),number = 100)
13.740918666999988

This result should be a function, the main script will iterate through file list, and should check different hashes for different files (from 1 to 4). Any ideas how I can achieve that?

Upvotes: 3

Views: 1794

Answers (2)

Capitan Harlock
Capitan Harlock

Reputation: 11

You can organize your code in this way:

from hashlib import md5, sha1, sha256
from io import DEFAULT_BUFFER_SIZE
 
def hasher(filename, func):
    # how to use:
    # print(f'{hasher(filename, md5())}')
    # print(f'{hasher(filename, sha1())}')
    # print(f'{hasher(filename, sha256())}')
    value = func
    with open(filename, mode='rb') as fl:
        chunk = fl.read(DEFAULT_BUFFER_SIZE)
        while chunk:
            value.update(chunk)
            chunk = fl.read(DEFAULT_BUFFER_SIZE)
    return value.hexdigest()

For parallel hash calculation:

def multi_hash(filename, func):
    # how to use : print(f'{multi_hash(filename, [md5(), sha1(), sha256()])}')
    value = func
    with open(filename, mode='rb') as fl:
        chunk = fl.read(DEFAULT_BUFFER_SIZE)
        while chunk:
            for v in value:
                v.update(chunk)
            chunk = fl.read(DEFAULT_BUFFER_SIZE)
    return [h.hexdigest() for h in value]

Upvotes: 0

Mihail Feraru
Mihail Feraru

Reputation: 1469

As someone stated in the comments, you could use concurrent.futures. I've done few benchmarks and the most efficient way to do it was using ProcessPoolExecutor. Here is an example:

executor = ProcessPoolExecutor(4)
executor.map(hash_function, files)
executor.shutdown()

If you want to take a look at my benchmarks you can find them here and the results:

Total using read_file: 10.121980099997018
Total using hash_md5_sha: 40.49621040000693
Total (multi-thread) using read_file: 6.246223400000417
Total (multi-thread) using hash_md5_sha: 19.588415799999893
Total (multi-core) using read_file: 4.099713300000076
Total (multi-core) using hash_md5_sha: 14.448464199999762

I used 40 files of 300 MiB each for testing.

Upvotes: 0

Related Questions