Ben Kovitz
Ben Kovitz

Reputation: 5030

How can I get deterministic hash values for class objects?

I have an application running in Python 3.9.4 where I store class objects in sets (along with many other kinds of objects). I'm getting non-deterministic behavior even when PYTHONHASHSEED=0 because class objects get non-deterministic hash codes. I assume that's because class objects' hash codes come from their addresses in memory.

For example, here are two runs of a little test program, where Before and Equation are classes:

print(hash(Before), hash(Equation), hash(int))
304555224 304593057 271715397

print(hash(Before), hash(Equation), hash(int))
326601328 293027788 273337413

How can I get Python to generate deterministic hash values for class objects?

Is there a metaclass or something that I could monkey-patch so that all class objects, even int, get a hash function that I specify?

Upvotes: 2

Views: 744

Answers (1)

jsbueno
jsbueno

Reputation: 110561

Hash for classes is deterministic within the same process . Yes, in cPython it is memory based - but then you can't simply "move" a class object to another memory address using Python code.

If you happen to use some serialization/de-serialization transforms with the classes, the de-serialized objects will ordinarily be new objects, distinct from the original ones, and therefore will hash differently.

For the note: I could not reproduce the behavior you stated in the question: on the same process, the hashes for the class objects will be the same.

If you are calculating the hashes in different processes, though, the will differ. So, although you don't mention multiprocessing there, I assume that is your working case.

Then, indeed, implementing __hash__ and __eq__ proper methods on the metaclass can allow you a stable, across process, hashing - but you can't do that with built-in classes such as int: those are built in native code and can't be changed on the Python side. On the other hand, despite the hash number shown being different for these built-in classes, whatever you are using to serialize/deserialize your classes (that is what Python does for communicating data across processes, even if you don't do any explicit de/serializing) .

Then we come to, while it is straightforward to add __eq__ and __hash__ methods to a metaclass to your classes, it'd be better to ensure that on deserializing, it would always yield the same object (with the same ID). hash stability, as you put it, could possibly ensure you have always the same class, but it would depend on how you'd write your code: it is a bit tricky to retrieve the object instance that is already inside a set, if you check positively for containship of another instance that matches it - the most straightfoward way would be building a identity-dictionary out of a set, and then use the value:

my_registry_dict = {element: element for element in my_registry_set}
my_class = my_registry_dict[incoming_class]

With this in mind, we can have a custom metaclass that not only add __eq__ and __hash__- and you have to pick what elements of the classes you will want to compare for equality - class.__qualname__ can be a simple and functional attribute to use - but also customize the __new__ method so that upon de-serializing the same class a second time will always re-use the first class object defined in the current process (i.e.: ensuring the "singleton" behavior Python classes enjoy in non-corner cases like yours seems to be)

class Meta(type):
    registry = {}
    def __new__(mcls, name, bases, namespace):
        cls = super().__new__(mcls, name, bases, namespace)
        if cls not in mcls.registry:
            mcls.registry[cls] = cls
        else:
            # reuse the previously created class
            cls = mcls.registry[cls]
        return cls

    def __hash__(cls):
        # when working with metaclasses, using the name `cls` instead of `self``
        # helps reminding us that we are dealing with instances that are
        # actually classes.
        return hash(cls.__qualname__)

    def __eq__(cls, other):
        return cls.__qualname__ == other.__qualname__


Upvotes: 1

Related Questions