Overriding __eq__ and __hash__ without downgrading performace

Question

I have two classes Structure and Fragment that represent chemical structures. If I compare two instances of the same class, the comparison should behave normally, but if I compare a structure with a fragment the comparison should be based on a string attribute that uniquely identify the chemical structure.

Unfortunately, in my attempt the run time increases dramatically (operations with sets and dictionaries).

class Structure:

def __eq__(self, other):
    if isinstance(other, Structure):
        return self is other
    else:
        return self.cansmiles is other.cansmiles_unstarred

def __hash__(self):
    return hash(self.cansmiles)

Fragment eq and hash are implemented with the same logic.

The "cansmiles" string can be pretty long (usually 25, up to 100), I tried to slice it without luck.

Questions are: what is so magic in built in hash and eq to be so efficient? Is there a way to override them efficiently? Any suggestions?

UPDATE: Can I avoid computing the hash value every time __hash__ is called?

Aleksi Torhamo · Accepted Answer

Like mentioned in the comments, you have a bigger problem: Your code has bugs. While is might be slightly faster, it also does the wrong thing.

While some quick tests might lead you to believe it works - due to Python optimizing some simple cases - these examples show that it doesn't work in general:

def f():
    return 'not always identical!'

a = 'not always identical!'
b = f()

# Will print: True False -- that is, the strings are equal, but not the same string object
print(a == b, a is b)

x = 'a'
a = x*2
b = 'aa'

# Will print: True False -- that is, the strings are equal, but not the same string object
print(a == b, a is b)

As for your __eq__() implementation, it'll break if you try to compare anything other than Structures and Fragments (since other objects will probably not have cansmiles/cansmiles_unstarred attributes, giving an AttributeError - or if they do, the code will do the wrong thing, since you probably don't want to compare equal to an random unknown object).

Now, think about this part of your code:

if isinstance(other, Structure):
    return self is other

Since self is a Structure, if self is the same object as other, other is obviously a Structure, too - so the if is kind of unnecessary. If you look at both of these problems, you'll realize the whole thing is just the wrong way around: You don't need to check the type for the is, but you do need to check the type for the string check! Thus:

def __eq__(self, other):
    if isinstance(other, Fragment):
        return self.cansmiles == other.cansmiles_unstarred
    else:
        return self is other

Now, as for how this change will affect the speed of your dictionary lookups: It won't. Python optimizes dictionary lookups by first checking with is, before falling back to ==. So if you do a lookup with the same type as you used to save the data, since your equality inside the same type is based on identity, the only time __eq__() will actually get called is if you have a hash collision (rare).

If you do a lookup with the other type though, the object won't be the same, so it'll fall back to == and call the __eq__() - but the change from is to == still won't affect the speed: == might be slightly slower than is, but function calls are an order of magnitude slower, so the time taken by isinstance() will dwarf the time taken by either comparison.

So why are the built-in __eq__() and __hash__() so much faster? Well, why is sum(l) faster than for v in l: t += v? Because running Python bytecode is way slower than running native code. In the comments, you said that is only checks a pointer - but that isn't really true, is it? You have to fetch the next opcode and its argument from the bytecode, select the correct handler, jump to the handler and so on.

If you say def __eq__(self, other): return self is other, and call that function, instead of doing "just a pointer comparison", you have to set up an execution frame, populate a dictionary for the local scope, do two lookups to the dictionary to get the values based on the names, push them in the stack, pop the values from the stack, do the comparison, push the result on the stack, pop the result from the stack, and finally return the value - and each step includes fetching and decoding the bytecode, jumping around, and so on. So yeah, that's going to be slower than the built-in __eq__() which is probably actually very close to "just a pointer comparison".

As for avoiding recalculating the hash: You already do - you're hashing strings, and strings cache their hashes, so they aren't being recalculated. The built-in __hash__() is based on the object's id, so it should be constant-time, so it'll obviously be faster than calculating the hash of a string, which should be linear-time. But even if the string hash was being recalculated, that's being done in native code, so the speed difference would be tiny - and dwarfed by the fact that you're switching from native code to running python bytecode when you define a custom one.

You can of course cache the value returned by hash() yourself - you'll need to add an if to check if the hash has already been calculated, but since calling functions is relatively slow, even with the added code the result will probably still be slightly faster. But you'd be adding code, and reducing the readability, for very small gains in speed. If you want speed, why don't you just simply use the strings as keys directly? Or if speed is really important, you might want to take a look at PyPy (or even Numba).

Overriding eq and hash without downgrading performace

Answers (1)

Related Questions