Reputation: 5551
I have a list from model objects ( length about 1.5 million ).
class Word(object):
def __init__(self, id, word, arg1 , arg2 )
self.id = id
self.word = word
self.arg1 = arg1
self.arg2 = arg2
There is many objects that have same word
property but different id, arg1, arg2
for example :
1 hi 2 3
2 hi 4 6
3 hello 2 7
4 hi 2 7
5 world 1 9
6 hello 3 3
7 code 5 2
There is hi => 3
and hello => 2
How can i check duplicate items by word
property and remove them in a fast way
Note that the length is too long
Upvotes: 0
Views: 2387
Reputation: 26039
I find converting to a dataframe and using drop_duplicates
easy.
Use:
DataFrame.drop_duplicates(subset=['word'], keep='first')
This will return a deduplicated dataframe deleting all rows having the same word
property while preserving the first occurance.
If you wish to drop duplicates except for the last occurrence, use keep='last'
.
To drop all duplicates, use keep=False
.
Upvotes: 2
Reputation: 19816
An option would be using this 3rd-party library toolz
In your case, you can use it like this:
import toolz
unique_words = toolz.unique(words, key=lambda w: w.word)
# unique_words is a generator object
Output:
>>> unique_words = toolz.unique(words, key=lambda w: w.word)
>>> for w in unique_words:
... print(w.word)
...
hi
hello
world
code
Upvotes: 0
Reputation: 36412
You simply want a set
. That's a container where only one object with the same key can exist. Here's the official documentation.
What you want to do is make your Word
class hashable:
__eq__(otherword)
method that takes another word and returns true if the otherword's .word
is equal to its own, and__hash__
method that returns a hash of its content. I'd just return self.word.__hash__()
.Then, set
makes sure each word is only in there once. You can also combine, cut, subtract sets like you can do it with mathematical sets (the things in curly brackets {} ).
A word about your application: At 1.5 million objects, it really looks like you should be having a table rather than a list of Objects, because that just means you really have about as much overhead per row in your table as content (if not even more).
Python's "Pandas" module is probably the tool to use here. It very likely obsoletes most of the stuff you've written so far.
Upvotes: 4