S.B
S.B

Reputation: 16536

Membership testing with hash collision for sets in python

Here is my class which I intentionally made it so that hash collision occurs:

class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age

    def __repr__(self):
        return f'({self.name} - {self.age})'

    def __eq__(self, other):
        print('__eq__ called for', self)
        return False

    def __hash__(self):
        print('__hash__ called for', self)
        return 1


obj1 = Person('Soroush', 25)
obj2 = Person('John', 23)
print('\n--------Creation of set---------\n')
s = {obj1, obj2}
print(s)
print('\n-------Membership testing-------\n')
print(obj2 in s)

output:

--------Creation of set---------

__hash__ called for (Soroush - 25)
__hash__ called for (John - 23)
__eq__ called for (Soroush - 25)
{(Soroush - 25), (John - 23)}

-------Membership testing-------

__hash__ called for (John - 23)
__eq__ called for (Soroush - 25)
True

*My question is from "Membership testing" part of the output:

In obj2 in s statement, python first calculates the hash of obj2 (good, first line of output), then because obj1 and obj2 have the same hash, python is going to see if they are the same items or not so __eq__ of the obj1 is called(second line of the output). They are not ! I intentionally returned False for all instances. So python is continue checking the probe sequence. Here again the hashes are equal! why python doesn't call __eq__ of obj2 to see if those items are equal or not ?

I expected to see __eq__ called for (John - 23) in third line and also because obj2 is not equal to obj2 , last line of the output should be False. like:

print(obj2 == obj2)

output :

__eq__ called for (John - 23)
False

So, why there is no __eq__ called for (John - 23) and why True? They are not the same items(same thing happened a second ago with obj1) .

Upvotes: 3

Views: 197

Answers (2)

juanpa.arrivillaga
juanpa.arrivillaga

Reputation: 96246

All built-in container types check object identity before testing (potentially expensive) object equality. This behavior is documented:

For container types such as list, tuple, set, frozenset, dict, or collections.deque, the expression x in y is equivalent to any(x is e or x == e for e in y).

Of course, that isn't how hash-based containers actually check membership. Again, this is an optimization, this spares potentially expensive equality operations, consider checking the equality of two large strings.

Here's a built-in that works the same way, the famously problematic floating point NaN:

>>> nan = float('nan')
>>> nan == nan
False
>>> nan is nan
True
>>> nan is float('nan')
False
>>> nan == float('nan')
False

And this results in the following behavior with sets:

>>> nan = float('nan')
>>> myset = {nan}
>>> nan in myset
True
>>> float('nan') in myset
False

Note, it is false when we create a new float object.

Upvotes: 4

Epsi95
Epsi95

Reputation: 9047

So, why there is no __eq__ called for (John - 23)

This can be easily achieved by using for in

for i in s:
    print(i)
    if i == obj2:
        print('equal', i)
-------Membership testing-------

(John - 23)
__eq__ called for (John - 23) (John - 23)
(Soroush - 25)
__eq__ called for (Soroush - 25) (John - 23)

But obviously, this is not what you want, let's add another person and try the in operator

obj1 = Person('Soroush', 25)
obj2 = Person('John', 23)
obj3 = Person('Kite', 23)
print('\n--------Creation of set---------\n')
s = {obj3, obj2, obj1}
print(s)
print('\n-------Membership testing-------\n')
print(obj2 in s)
-------Membership testing-------
__hash__ called for (John - 23)
__eq__ called for (Kite - 23) (John - 23)
True

Now by changing the set element position

.
.
s = {obj3, obj1, obj2}
.
.
__hash__ called for (John - 23)
__eq__ called for (Kite - 23) (John - 23)
__eq__ called for (Soroush - 25) (John - 23)
True

From this, I came to the conclusion that the True is coming from the check of some parameter for the 3rd element (last example). And that is most obviously the id or memory location in case CPython. Because for this case

.
.
print(Person('John', 23) in s)
.
.
__hash__ called for (John - 23)
__eq__ called for (Kite - 23) (John - 23)
__eq__ called for (Soroush - 25) (John - 23)
__eq__ called for (John - 23) (John - 23)
False

You get false, because id(Person('John', 23)) is obviously not same as obj2. Similar behavior can be reproduced in for in instead of == use is

obj1 = Person('Soroush', 25)
obj2 = Person('John', 23)
print('\n--------Creation of set---------\n')
s = {obj2, obj1}
print(s)
print('\n-------Membership testing-------\n')

for i in s:
    print(i)
    if i is obj2:
        print('equal', i)
     

print(obj2 in s)
-------Membership testing-------

(John - 23)
equal (John - 23)          ------------------------
(Soroush - 25)                                     |
                                                   |
                                                   |--you are missing this output in `in` check
__hash__ called for (John - 23)                    |
True      -----------------------------------------|

Upvotes: 1

Related Questions