Reputation: 52099
In python2.7, I'm successfully using hash()
to place objects into buckets stored persistently on disk. A mockup code looks like this:
class PersistentDict(object):
def __setitem__(self, key, value):
bucket_index = (hash(key)&0xffffffff) % self.bucket_count
self._store_to_bucket(bucket_index, key, value)
def __getitem__(self, key):
bucket_index = (hash(key)&0xffffffff) % self.bucket_count
return self._fetch_from_bucket(bucket_index)[key]
In python3, hash()
uses a random or fixed salt, which makes it unusable/suboptimal for this [1]. Apparently, it's not possible to use a fixed salt for specific invocations. So, I need an alternative:
dict
/set
)I've already tried using hash functions from hashlib (slow!) and checksums from zlib (apparently not ideal for hashing, but meh) which work fine with strings/bytes. However, they work only on bytes-like objects, whereas hash()
works with almost everything.
[1] Using hash()
to identify buckets is either:
PersistentDict
s were created with different saltsUpvotes: 4
Views: 2532
Reputation: 68310
We use this code for our workflow manager Sisyphus, where we use hashes to store the outputs of jobs on disks, based on input parameters, thus we hash the input parameters.
Specifically:
def short_hash(obj,
length=12,
chars='0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'):
"""
:param object obj:
:param int length:
:param str|T chars:
:rtype: str|T
"""
h = hashlib.sha256(sis_hash_helper(obj)).digest()
h = int.from_bytes(h, byteorder='big', signed=False)
ls = []
for i in range(length):
ls.append(chars[int(h % len(chars))])
h = h // len(chars)
return ''.join(ls)
def get_object_state(obj):
"""
Export current object status
Comment: Maybe obj.__reduce__() is a better idea? is it stable for hashing?
"""
if hasattr(obj, '__getnewargs_ex__'):
args = obj.__getnewargs_ex__()
elif hasattr(obj, '__getnewargs__'):
args = obj.__getnewargs__()
else:
args = None
if hasattr(obj, '__sis_state__'):
state = obj.__sis_state__()
elif hasattr(obj, '__getstate__'):
state = obj.__getstate__()
elif hasattr(obj, '__dict__'):
state = obj.__dict__
elif hasattr(obj, '__slots__'):
state = {k: getattr(obj, k) for k in obj.__slots__ if hasattr(obj, k)}
else:
assert args is not None, "Failed to get object state of: %s" % repr(obj)
state = None
if args is None:
return state
else:
return args, state
def sis_hash_helper(obj):
"""
Takes most object and tries to convert the current state into bytes.
:param object obj:
:rtype: bytes
"""
# Store type to ensure it's unique
byte_list = [type(obj).__qualname__.encode()]
# Using type and not isinstance to avoid derived types
if isinstance(obj, bytes):
byte_list.append(obj)
elif obj is None:
pass
elif type(obj) in (int, float, bool, str, complex):
byte_list.append(repr(obj).encode())
elif type(obj) in (list, tuple):
byte_list += map(sis_hash_helper, obj)
elif type(obj) in (set, frozenset):
byte_list += sorted(map(sis_hash_helper, obj))
elif isinstance(obj, dict):
# sort items to ensure they are always in the same order
byte_list += sorted(map(sis_hash_helper, obj.items()))
elif isfunction(obj):
# Handle functions
# Not a nice way to check if the given function is a lambda function, but the best I found
# assert not isinstance(lambda m: m, LambdaType) is true for all functions
assert obj.__name__ != '<lambda>', "Hashing of lambda functions is not supported"
byte_list.append(sis_hash_helper((obj.__module__, obj.__qualname__)))
elif isclass(obj):
byte_list.append(sis_hash_helper((obj.__module__, obj.__qualname__)))
elif hasattr(obj, '_sis_hash'):
# sis job or path object
return obj._sis_hash()
else:
byte_list.append(sis_hash_helper(get_object_state(obj)))
byte_str = b'(' + b', '.join(byte_list) + b')'
if len(byte_str) > 4096:
# hash long outputs to avoid arbitrary long return values. 4096 is just
# picked because it looked good and not optimized,
# it's most likely not that important.
return hashlib.sha256(byte_str).digest()
else:
return byte_str
Upvotes: 0
Reputation: 1427
A good alternative is xxhash
. It provides fast integer hashes (as Python's hash), but they are consistent across multiple runs and machines:
import xxhash
print(xxhash.xxh32_intdigest('hash me please'))
print(xxhash.xxh64_intdigest('hash me please'))
print(xxhash.xxh128_intdigest('hash me please'))
obj = [0, 1, 2]
print(xxhash.xxh64_intdigest(str(obj)))
print(xxhash.xxh64_intdigest(str(obj), seed=0))
print(xxhash.xxh64_intdigest(str(obj), seed=42))
'''
Prints:
465606393
6686454294630346756
110192986912562192471431245034848549222
11514819435353980464
11514819435353980464
11772420285327955252
'''
Since it hashes strings, for arbitrary objects you may hash str(your_object)
or repr(your_object)
. You can also set an optional seed parameter that defaults to zero, as shown in the example.
Yet another alternative is pyhash
. It works fine on Linux, but after some bad experiences installing it on Windows machines, I recommend xxhash
which works well on all platforms and is even faster than pyhash
for some small examples that I tested.
Upvotes: 1
Reputation: 52099
I've had success using a combination of hash
and zlib.adler32
. The most straightforward implementation is this:
def hashkey(obj, salt=0):
"""
Create a key suitable for use in hashmaps
:param obj: object for which to create a key
:type: str, bytes, :py:class:`datetime.datetime`, object
:param salt: an optional salt to add to the key value
:type salt: int
:return: numeric key to `obj`
:rtype: int
"""
if obj is None:
return 0
if isinstance(obj, str):
return zlib.adler32(obj.encode(), salt) & 0xffffffff
elif isinstance(obj, bytes):
return zlib.adler32(obj, salt) & 0xffffffff
elif isinstance(obj, datetime_type):
return zlib.adler32(str(obj).encode(), salt) & 0xffffffff
return hash(obj) & 0xffffffff
With Python 3.4.3, this is a lot slower than calling plain hash
, which takes roughly 0.07 usec. For a regular object, hashkey
takes ~1.0 usec instead. 0.8 usec for bytes
and 0.7 for str
.
Overhead is roughly as follows:
hash(obj)
vs def pyhash(obj): return hash(obj)
)isinstance
zlib.adler32
or zlib.crc32
vs hash
: ~0.160 usec vs ~ 0.75 usec (adler and crc are +/- 4 usec)obj.encode()
of str
objects ("foobar"
)str(obj).encode()
of datetime.datetime
objectsThe most optimization comes from ordering of the if
statements. If one mostly expects plain objects, the following is the fastest I could come up with:
def hashkey_c(obj, salt=0):
if obj.__class__ in hashkey_c.types:
if obj is None:
return 0
if obj.__class__ is str:
return zlib.adler32(obj.encode(), salt) & 0xffffffff
elif obj.__class__ is bytes:
return zlib.adler32(obj, salt) & 0xffffffff
elif obj.__class__ is datetime_type:
return zlib.adler32(str(obj).encode(), salt) & 0xffffffff
return hash(obj) & 0xffffffff
hashkey_c.types = {str, bytes, datetime_type, type(None)}
Total time: ~0.7 usec for str
and bytes
, abysmal for datetime
, 0.35 usec for objects, ints, etc. Using a dict
to map type to hash comparable, if one uses an explicit check on the dict
keys (aka types) separately (i.e. not obj.__class__ in hashkey.dict_types
but obj.__class__ in hashkey.explicit_dict_types
).
Some additional notes:
hash
is not stable across interpreter starts for any object using the default __hash__
implementation, including None
__hash__
) containing a salted type, e.g. (1, 2, 'three')
Upvotes: 1