cardamom
cardamom

Reputation: 7431

get murmur hash of a file with Python 3

The documentation for the python library Murmur is a bit sparse.

I have been trying to adapt the code from this answer:

import hashlib
from functools import partial

def md5sum(filename):
    with open(filename, mode='rb') as f:
        d = hashlib.md5()
        for buf in iter(partial(f.read, 128), b''):
            d.update(buf)
    return d.hexdigest()

print(md5sum('utils.py'))

From what I read in the answer, the md5 can't operate on the whole file at once so it needs this looping. Not sure exactly what would happen on the line d.update(buf) however.

The public methods in hashlib.md5() are:

 'block_size',
 'copy',
 'digest',
 'digest_size',
 'hexdigest',
 'name',
 'update'

whereas mmh3 has

'hash',
'hash64',
'hash_bytes'

No update or hexdigest methods..

Does anyone know how to achieve a similar result?

The motivation is testing for uniqueness as fast as possible, the results here suggests murmur is a good candidate.

Update -

Following the comment from @Bakuriu I had a look at mmh3 which seems to be better documented.

The public methods inside it are:

import mmh3
print([x for x in dir(mmh3) if x[0]!='_'])
>>> ['hash', 'hash128', 'hash64', 'hash_bytes', 'hash_from_buffer']

..so no "update" method. I had a look at the source code for mmh3.hash_from_buffer but it does not look like it contains a loop and it is also not in Python, can't really follow it. Here is a link to the line

So for now will just use CRC-32 which is supposed to be almost as good for the purpose, and it is well documented how to do it. If anyone posts a solution will test it out.

Upvotes: 3

Views: 2842

Answers (1)

pscheid
pscheid

Reputation: 510

To hash a file using murmur, one has to load it completely into memory and hash it in one go.

import mmh3

with open('main.py') as file:
    data = file.read()

hash = mmh3.hash_bytes(data, 0xBEFFE)
print(hash.hex())

If your file is too large to fit into memory, you could use incremental/progressive hashing: add your data in multiple chunks and hash them on the fly (like your example above).

Is there a Python library for progressive hashing with murmur?
I tried to find one, but it seems there is none.

Is progressive hashing even possible with murmur?
There is a working implementation in C:

Upvotes: 1

Related Questions