In python2.7, I'm successfully using hash() to place objects into buckets stored persistently on disk. A mockup code looks like this: class PersistentDict(object): def __setitem__(self, key, value): bucket_index = (hash(key)&0xffffffff) % self.bucket_count self._store_to_bucket(bucket_index, key, value) def __getitem__(self, key): bucket_index = (hash(key)&0xffffffff) % self.bucket_count return self._fetch_from_bucket(bucket_index)[key] In python3, hash() uses a random or fixed salt, which makes it unusable/suboptimal for this [1]. Apparently, it's not possible to use a fixed salt for specific invocations . So, I need an alternative: Must be stable across interpreter invocations May require parameters supplied at execution time, e.g. setting a salt in the call Must support arbitrary objects (anything supported by dict / set ) I've already tried using hash functions from hashlib (slow!) and checksums from zlib (apparently not ideal for hashing, but meh) which work fine with strings/bytes. However, they work only on bytes-like objects, whereas hash() works with almost everything. [1] Using hash() to identify buckets is either: Not reliable across interpreter invocations, if salts are random Prevents applications from using the random salting feature, if salts are fixed Unusable if two PersistentDict s were created with different salts

Reputation: 52099

Alternative to python hash function for arbitrary objects

In python2.7, I'm successfully using hash() to place objects into buckets stored persistently on disk. A mockup code looks like this:

class PersistentDict(object):
  def __setitem__(self, key, value):
    bucket_index = (hash(key)&0xffffffff) % self.bucket_count
    self._store_to_bucket(bucket_index, key, value)

  def __getitem__(self, key):
    bucket_index = (hash(key)&0xffffffff) % self.bucket_count
    return self._fetch_from_bucket(bucket_index)[key]

In python3, hash() uses a random or fixed salt, which makes it unusable/suboptimal for this [1]. Apparently, it's not possible to use a fixed salt for specific invocations. So, I need an alternative:

Must be stable across interpreter invocations
May require parameters supplied at execution time, e.g. setting a salt in the call
Must support arbitrary objects (anything supported by dict/set)

I've already tried using hash functions from hashlib (slow!) and checksums from zlib (apparently not ideal for hashing, but meh) which work fine with strings/bytes. However, they work only on bytes-like objects, whereas hash() works with almost everything.

[1] Using hash() to identify buckets is either:

Not reliable across interpreter invocations, if salts are random
Prevents applications from using the random salting feature, if salts are fixed
Unusable if two PersistentDicts were created with different salts

Upvotes: 4

Answers (3)

Albert

Reputation: 68310

We use this code for our workflow manager Sisyphus, where we use hashes to store the outputs of jobs on disks, based on input parameters, thus we hash the input parameters.

Specifically:

def short_hash(obj,
               length=12,
               chars='0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'):
    """
    :param object obj:
    :param int length:
    :param str|T chars:
    :rtype: str|T
    """
    h = hashlib.sha256(sis_hash_helper(obj)).digest()
    h = int.from_bytes(h, byteorder='big', signed=False)
    ls = []
    for i in range(length):
        ls.append(chars[int(h % len(chars))])
        h = h // len(chars)
    return ''.join(ls)


def get_object_state(obj):
    """
    Export current object status
    Comment: Maybe obj.__reduce__() is a better idea? is it stable for hashing?
    """

    if hasattr(obj, '__getnewargs_ex__'):
        args = obj.__getnewargs_ex__()
    elif hasattr(obj, '__getnewargs__'):
        args = obj.__getnewargs__()
    else:
        args = None

    if hasattr(obj, '__sis_state__'):
        state = obj.__sis_state__()
    elif hasattr(obj, '__getstate__'):
        state = obj.__getstate__()
    elif hasattr(obj, '__dict__'):
        state = obj.__dict__
    elif hasattr(obj, '__slots__'):
        state = {k: getattr(obj, k) for k in obj.__slots__ if hasattr(obj, k)}
    else:
        assert args is not None, "Failed to get object state of: %s" % repr(obj)
        state = None

    if args is None:
        return state
    else:
        return args, state


def sis_hash_helper(obj):
    """
    Takes most object and tries to convert the current state into bytes.
    :param object obj:
    :rtype: bytes
    """

    # Store type to ensure it's unique
    byte_list = [type(obj).__qualname__.encode()]

    # Using type and not isinstance to avoid derived types
    if isinstance(obj, bytes):
        byte_list.append(obj)
    elif obj is None:
        pass
    elif type(obj) in (int, float, bool, str, complex):
        byte_list.append(repr(obj).encode())
    elif type(obj) in (list, tuple):
        byte_list += map(sis_hash_helper, obj)
    elif type(obj) in (set, frozenset):
        byte_list += sorted(map(sis_hash_helper, obj))
    elif isinstance(obj, dict):
        # sort items to ensure they are always in the same order
        byte_list += sorted(map(sis_hash_helper, obj.items()))
    elif isfunction(obj):
        # Handle functions
        # Not a nice way to check if the given function is a lambda function, but the best I found
        # assert not isinstance(lambda m: m, LambdaType) is true for all functions
        assert obj.__name__ != '<lambda>', "Hashing of lambda functions is not supported"
        byte_list.append(sis_hash_helper((obj.__module__, obj.__qualname__)))
    elif isclass(obj):
        byte_list.append(sis_hash_helper((obj.__module__, obj.__qualname__)))
    elif hasattr(obj, '_sis_hash'):
        # sis job or path object
        return obj._sis_hash()
    else:
        byte_list.append(sis_hash_helper(get_object_state(obj)))

    byte_str = b'(' + b', '.join(byte_list) + b')'
    if len(byte_str) > 4096:
        # hash long outputs to avoid arbitrary long return values. 4096 is just
        # picked because it looked good and not optimized,
        # it's most likely not that important.
        return hashlib.sha256(byte_str).digest()
    else:
        return byte_str

Upvotes: 0

Carlos Pinzón

Reputation: 1427

A good alternative is xxhash. It provides fast integer hashes (as Python's hash), but they are consistent across multiple runs and machines:

import xxhash
print(xxhash.xxh32_intdigest('hash me please'))
print(xxhash.xxh64_intdigest('hash me please'))
print(xxhash.xxh128_intdigest('hash me please'))

obj = [0, 1, 2]
print(xxhash.xxh64_intdigest(str(obj)))
print(xxhash.xxh64_intdigest(str(obj), seed=0))
print(xxhash.xxh64_intdigest(str(obj), seed=42))

'''
Prints:
  465606393
  6686454294630346756
  110192986912562192471431245034848549222
  11514819435353980464
  11514819435353980464
  11772420285327955252
'''

Since it hashes strings, for arbitrary objects you may hash str(your_object) or repr(your_object). You can also set an optional seed parameter that defaults to zero, as shown in the example.

Yet another alternative is pyhash. It works fine on Linux, but after some bad experiences installing it on Windows machines, I recommend xxhash which works well on all platforms and is even faster than pyhash for some small examples that I tested.

Upvotes: 1

MisterMiyagi

Reputation: 52099

I've had success using a combination of hash and zlib.adler32. The most straightforward implementation is this:

def hashkey(obj, salt=0):
  """
  Create a key suitable for use in hashmaps

  :param obj: object for which to create a key
  :type: str, bytes, :py:class:`datetime.datetime`, object
  :param salt: an optional salt to add to the key value
  :type salt: int
  :return: numeric key to `obj`
  :rtype: int
  """
  if obj is None:
    return 0
  if isinstance(obj, str):
    return zlib.adler32(obj.encode(), salt) & 0xffffffff
  elif isinstance(obj, bytes):
    return zlib.adler32(obj, salt) & 0xffffffff
  elif isinstance(obj, datetime_type):
    return zlib.adler32(str(obj).encode(), salt) & 0xffffffff
  return hash(obj) & 0xffffffff

With Python 3.4.3, this is a lot slower than calling plain hash, which takes roughly 0.07 usec. For a regular object, hashkey takes ~1.0 usec instead. 0.8 usec for bytes and 0.7 for str.

Overhead is roughly as follows:

0.1 usec for the function call (hash(obj) vs def pyhash(obj): return hash(obj))
0.2 usec to 0.5 usec for selecting the hash function via isinstance
0.75 usec for zlib.adler32 or zlib.crc32 vs hash: ~0.160 usec vs ~ 0.75 usec (adler and crc are +/- 4 usec)
0.15 usec for obj.encode() of str objects ("foobar")
1.5 usec for str(obj).encode() of datetime.datetime objects

The most optimization comes from ordering of the if statements. If one mostly expects plain objects, the following is the fastest I could come up with:

def hashkey_c(obj, salt=0):
  if obj.__class__ in hashkey_c.types:
    if obj is None:
      return 0
    if obj.__class__ is str:
      return zlib.adler32(obj.encode(), salt) & 0xffffffff
    elif obj.__class__ is bytes:
      return zlib.adler32(obj, salt) & 0xffffffff
    elif obj.__class__ is datetime_type:
      return zlib.adler32(str(obj).encode(), salt) & 0xffffffff
  return hash(obj) & 0xffffffff
hashkey_c.types = {str, bytes, datetime_type, type(None)}

Total time: ~0.7 usec for str and bytes, abysmal for datetime, 0.35 usec for objects, ints, etc. Using a dict to map type to hash comparable, if one uses an explicit check on the dict keys (aka types) separately (i.e. not obj.__class__ in hashkey.dict_types but obj.__class__ in hashkey.explicit_dict_types).

Some additional notes:

hash is not stable across interpreter starts for any object using the default __hash__ implementation, including None
It does not work properly for immutable containers (which define __hash__) containing a salted type, e.g. (1, 2, 'three')

Upvotes: 1

Alternative to python hash function for arbitrary objects

Answers (3)

Related Questions