Bdfy
Bdfy

Reputation: 24621

How to remove duplicates in set for objects?

I have set of objects:

class Test(object):
    def __init__(self):
        self.i = random.randint(1,10)


res = set()

for i in range(0,1000):
    res.add(Test())

print len(res) = 1000

How to remove duplicates from set of objects ?

Thanks for answers, it's work:

class Test(object):
    def __init__(self, i):
        self.i = i
    #   self.i = random.randint(1,10)
    #   self.j = random.randint(1,20)

    def __keys(self):
        t = ()
        for key in self.__dict__:
            t = t + (self.__dict__[key],)
        return t

    def __eq__(self, other):
        return isinstance(other, Test) and self.__keys() == other.__keys()

    def __hash__(self):
        return hash(self.__keys())

res = set()

res.add(Test(2))
...
res.add(Test(8))

result: [2,8,3,4,5,6,7]

but how to save order ? Sets not support order. Can i use list instead set for example ?

Upvotes: 6

Views: 2095

Answers (4)

abarnert
abarnert

Reputation: 365717

Your first question is already answered by Pavel Anossov.

But you have another question:

but how to save order ? Sets not support order. Can i use list instead set for example ?

You can use a list, but there are a few downsides:

  • You get the wrong interface.
  • You don't get automatic handling of duplicates. You have to explicitly write if foo not in res: res.append(foo). Obviously, you can wrap this up in a function instead of writing it repeatedly, but it's still extra work.
  • It's going to be a lot less efficient if the collection can get large. Basically, adding a new element, checking whether an element already exists, etc. are all going to be O(N) instead of O(1).

What you want is something that works like an ordered set. Or, equivalently, like a list that doesn't allow duplicates.

If you do all your adds first, and then all your lookups, and you don't need lookups to be fast, you can get around this by first building a list, then using unique_everseen from the itertools recipes to remove duplicates.

Or you could just keep a set and a list or elements by order (or a list plus a set of elements seen so far). But that can get a bit complicated, so you might want to wrap it up.

Ideally, you want to wrap it up in a type that has exactly the same API as set. Something like an OrderedSet akin to collections.OrderedDict.

Fortunately, if you scroll to the bottom of that docs page, you'll see that exactly what you want already exists; there's a link to an OrderedSet recipe at ActiveState.

So, copy it, paste it into your code, then just change res = set() to res = OrderedSet(), and you're done.

Upvotes: 1

Blckknght
Blckknght

Reputation: 104712

Pavel Anossov's answer is great for allowing your class to be used in a set with the semantics you want. However, if you want to preserve the order of your items, you'll need a bit more. Here's a function that de-duplicates a list, as long as the list items are hashable:

def dedupe(lst):
    seen = set()
    results = []
    for item in lst:
        if item not in seen:
            seen.add(item)
            results.append(item)
    return results

A slightly more idiomatic version would be a generator, rather than a function that returns a list. This gets rid of the results variable, using yield rather than appending the unique values to it. I've also renamed the lst parameter to iterable, since it will work just as well on any iterable object (such as another generator).

def dedupe(iterable):
    seen = set()
    for item in iterable:
        if item not in seen:
            seen.add(item)
            yield item

Upvotes: 0

Dvx
Dvx

Reputation: 279

I think you can easily do what you want with a list as you asked in your first post since you defined the eq operator :

l = []
if Test(0) not in l : 
    l.append(Test(0))

My 2 cts ...

Upvotes: 0

Pavel Anossov
Pavel Anossov

Reputation: 62908

Your objects must be hashable (i.e. must have __eq__() and __hash__() defined) for sets to work properly with them:

class Test(object):
    def __init__(self):
        self.i = random.randint(1, 10)

    def __eq__(self, other):
        return self.i == other.i

    def __hash__(self):
        return self.i

An object is hashable if it has a hash value which never changes during its lifetime (it needs a __hash__() method), and can be compared to other objects (it needs an __eq__() or __cmp__() method). Hashable objects which compare equal must have the same hash value.

Hashability makes an object usable as a dictionary key and a set member, because these data structures use the hash value internally.

 

If you have several attributes, hash and compare a tuple of them (thanks, delnan):

class Test(object):
    def __init__(self):
        self.i = random.randint(1, 10)
        self.k = random.randint(1, 10)
        self.j = random.randint(1, 10)

    def __eq__(self, other):
        return (self.i, self.k, self.j) == (other.i, other.k, other.j)

    def __hash__(self):
        return hash((self.i, self.k, self.j))

Upvotes: 14

Related Questions