Arash Hatami
Arash Hatami

Reputation: 5551

Remove duplicates in list of model objects

I have a list from model objects ( length about 1.5 million ).

class Word(object):
    def __init__(self, id, word, arg1 , arg2 )
        self.id = id
        self.word = word
        self.arg1 = arg1 
        self.arg2 = arg2 

There is many objects that have same word property but different id, arg1, arg2 for example :

1     hi        2      3
2     hi        4      6
3     hello     2      7
4     hi        2      7
5     world     1      9
6     hello     3      3
7     code      5      2

There is hi => 3 and hello => 2

How can i check duplicate items by word property and remove them in a fast way
Note that the length is too long

Upvotes: 0

Views: 2387

Answers (3)

Austin
Austin

Reputation: 26039

I find converting to a dataframe and using drop_duplicates easy.

Use:

DataFrame.drop_duplicates(subset=['word'], keep='first')

This will return a deduplicated dataframe deleting all rows having the same word property while preserving the first occurance.

  • If you wish to drop duplicates except for the last occurrence, use keep='last'.

  • To drop all duplicates, use keep=False.

Upvotes: 2

ettanany
ettanany

Reputation: 19816

An option would be using this 3rd-party library toolz

In your case, you can use it like this:

import toolz

unique_words = toolz.unique(words, key=lambda w: w.word)
# unique_words is a generator object

Output:

>>> unique_words = toolz.unique(words, key=lambda w: w.word)
>>> for w in unique_words:
...     print(w.word)
... 
hi
hello
world
code

Upvotes: 0

Marcus Müller
Marcus Müller

Reputation: 36412

You simply want a set. That's a container where only one object with the same key can exist. Here's the official documentation.

What you want to do is make your Word class hashable:

  • it needs a __eq__(otherword) method that takes another word and returns true if the otherword's .word is equal to its own, and
  • it needs a __hash__ method that returns a hash of its content. I'd just return self.word.__hash__().

Then, set makes sure each word is only in there once. You can also combine, cut, subtract sets like you can do it with mathematical sets (the things in curly brackets {} ).

A word about your application: At 1.5 million objects, it really looks like you should be having a table rather than a list of Objects, because that just means you really have about as much overhead per row in your table as content (if not even more).

Python's "Pandas" module is probably the tool to use here. It very likely obsoletes most of the stuff you've written so far.

Upvotes: 4

Related Questions