Reputation: 26986
IIUC python hash of functions (e.g. for use as keys in dict
) is not stable across runs.
Can something like dill or other libraries be used to get a hash of a function which is stable across runs and different computers? (id
is of course not stable).
Upvotes: 3
Views: 1062
Reputation: 41
I stumbled about "hash() is not stable across runs" today. I am now using
import hashlib
from operator import xor
from struct import unpack
def stable_hash(a_string):
sha256 = hashlib.sha256()
sha256.update(bytes(a_string, "UTF-8"))
digest = sha256.digest()
h = 0
#
for index in range(0, len(digest) >> 3):
index8 = index << 3
bytes8 = digest[index8 : index8 + 8]
i = unpack('q', bytes8)[0]
h = xor(h, i)
#
return h
It's for string arguments. To use it e.g. for a dict you would pass str(tuple(sorted(a_dict.items()))) or something like that as argument. The "sorted" is important in this case to get a "canonical" representation.
Upvotes: 0
Reputation: 35247
I'm the dill
author. I've written a package called klepto
which is a hierarchical caching/database abstraction useful for local memory hashing and object sharing across parallel/distributed resources. It includes several options for building ids of functions.
See klepto.keymaps
and klepto.crypto
for hashing choices -- some work across parallel/distributed resources, some don't. One of the choices is serialization with dill
or otherwise.
klepto
is similar to joblib
, but designed specifically to have object permanence and sharing beyond a single python session. There may be something similar to klepto
in dask
.
Upvotes: 2
Reputation: 1065
As you mentioned, id
will almost never be the same across different processes and though surely across different machines. As per docs:
id(object): Return the “identity” of an object. This is an integer which is guaranteed to be unique and constant for this object during its lifetime. Two objects with non-overlapping lifetimes may have the same id() value.
This means that id
should be different because the objects created by every instance of your script reside in different places in the memory and are not the same object. id
defines the identity, it's not a checksum of a block of code.
The only thing that will be consistent over different instances of your script being executed is the name of the function.
One other approach that you could use to have a deterministic way to identify a block of code inside your script would be to calculate a checksum of the actual text. But controlling the contents of your methods should rather be handled by a versioning system like git. It is likely that if you need to calculate a hash sum of your code or a piece of it, you are doing something suboptimally.
Upvotes: 0