Uri
Uri

Reputation: 26986

Getting a hash of a function that is stable across runs

IIUC python hash of functions (e.g. for use as keys in dict) is not stable across runs.

Can something like dill or other libraries be used to get a hash of a function which is stable across runs and different computers? (id is of course not stable).

Upvotes: 3

Views: 1062

Answers (3)

amotzek
amotzek

Reputation: 41

I stumbled about "hash() is not stable across runs" today. I am now using

import hashlib
from operator import xor
from struct import unpack

def stable_hash(a_string):
    sha256 = hashlib.sha256()
    sha256.update(bytes(a_string, "UTF-8"))
    digest = sha256.digest()
    h = 0
    #
    for index in range(0, len(digest) >> 3):
        index8 = index << 3
        bytes8 = digest[index8 : index8 + 8]
        i = unpack('q', bytes8)[0]
        h = xor(h, i)
    #
    return h

It's for string arguments. To use it e.g. for a dict you would pass str(tuple(sorted(a_dict.items()))) or something like that as argument. The "sorted" is important in this case to get a "canonical" representation.

Upvotes: 0

Mike McKerns
Mike McKerns

Reputation: 35247

I'm the dill author. I've written a package called klepto which is a hierarchical caching/database abstraction useful for local memory hashing and object sharing across parallel/distributed resources. It includes several options for building ids of functions.

See klepto.keymaps and klepto.crypto for hashing choices -- some work across parallel/distributed resources, some don't. One of the choices is serialization with dill or otherwise.

klepto is similar to joblib, but designed specifically to have object permanence and sharing beyond a single python session. There may be something similar to klepto in dask.

Upvotes: 2

Artur
Artur

Reputation: 1065

As you mentioned, id will almost never be the same across different processes and though surely across different machines. As per docs:

id(object): Return the “identity” of an object. This is an integer which is guaranteed to be unique and constant for this object during its lifetime. Two objects with non-overlapping lifetimes may have the same id() value.

This means that id should be different because the objects created by every instance of your script reside in different places in the memory and are not the same object. id defines the identity, it's not a checksum of a block of code.

The only thing that will be consistent over different instances of your script being executed is the name of the function.

One other approach that you could use to have a deterministic way to identify a block of code inside your script would be to calculate a checksum of the actual text. But controlling the contents of your methods should rather be handled by a versioning system like git. It is likely that if you need to calculate a hash sum of your code or a piece of it, you are doing something suboptimally.

Upvotes: 0

Related Questions