Channel72
Channel72

Reputation: 24719

Python: detect duplicates using a set

I have a large number of objects I need to store in memory for processing in Python. Specifically, I'm trying to remove duplicates from a large set of objects. I want to consider two objects "equal" if a certain instance variable in the object is equal. So, I assumed the easiest way to do this would be to insert all my objects into a set, and override the __hash__ method so that it hashes the instance variable I'm concerned with.

So, as a test I tried the following:

class Person:
    def __init__(self, n, a):
        self.name = n
        self.age = a

    def __hash__(self):
        return hash(self.name)

    def __str__(self):
        return "{0}:{1}".format(self.name, self.age)

myset = set()
myset.add(Person("foo", 10))
myset.add(Person("bar", 20))
myset.add(Person("baz", 30))
myset.add(Person("foo", 1000)) # try adding a duplicate

for p in myset: print(p)

Here, I define a Person class, and any two instances of Person with the same name variable are to be equal, regardless of the value of any other instance variable. Unfortunately, this outputs:

baz:30
foo:10
bar:20
foo:1000

Note that foo appears twice, so this program failed to notice duplicates. Yet the expression hash(Person("foo", 10)) == hash(Person("foo", 1000)) is True. So why doesn't this properly detect duplicate Person objects?

Upvotes: 9

Views: 2964

Answers (5)

ninjagecko
ninjagecko

Reputation: 91092

A hash function effectively says "A maybe equals B" or "A not equals B (for sure)".

If it says "maybe equals" then equality has to be checked anyway to make sure, which is why you also need to implement __eq__.

Nevertheless, defining __hash__ will significantly speed things up by making "A not equal B (for sure)" an O(1) operation.

The hash function must however always follow the "hash rule":

  • "hash rule": equal things must hash to the same value
  • (justification: or else we'd say "A not equals B (for sure)" when that is not the case)

For example you could hash everything by def __hash__(self): return 1. This would still be correct, but it would be inefficient because you'd have to check __eq__ each time, which may be a long process if you have complicated large data structures (e.g. with large lists, dictionaries, etc.).

Do note that you technically follow the "hash rule" do this by ignoring age in your implementation def __hash__(self): return self.name. If Bob is a person of age 20 and Bob is another person of age 30 and they are different people (likely unless this is some sort of keeps-track-of-people-over-time-as-they-age program), then they will hash to the same value and have to be compared with __eq__. This is perfectly fine, but I would implement it like so:

def __hash__(self):
    return hash( (self.name, self.age) )

Do note that your way is still correct. It would however have been a coding error to use hash( (self.name, self.age) ) in a world where Person("Bob", age=20) and Person("Bob", age=30) were actually the same person, because the hash function would be saying they're different while the equals function would not (but be ignored).

Upvotes: 1

Sven Marnach
Sven Marnach

Reputation: 601649

A set obviously will have to deal with hash collisions. If the hash of two objects matches, the set will compare them using the == operator to make sure they are really equal. In your case, this will only yield True if the two objects are the same object (the standard implementation for user-defined classes).

Long story short: Also define __eq__() to make it work.

Upvotes: 4

ralphtheninja
ralphtheninja

Reputation: 133008

You also need the __ eq __() method.

Upvotes: 0

Ignacio Vazquez-Abrams
Ignacio Vazquez-Abrams

Reputation: 798676

You forgot to also define __eq__().

If a class does not define a __cmp__() or __eq__() method it should not define a __hash__() operation either; if it defines __cmp__() or __eq__() but not __hash__(), its instances will not be usable in hashed collections. If a class defines mutable objects and implements a __cmp__() or __eq__() method, it should not implement __hash__(), since hashable collection implementations require that a object’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket).

Upvotes: 13

VGE
VGE

Reputation: 4191

Hash function is not enough to distinguish object you have to implement the comparison function (ie. __eq__).

Upvotes: 2

Related Questions