Reputation: 11
I need to remove every duplicate item from a list of more than 100 million things. I tried converting the list to a set and back again using the Set method, but it is far too sluggish and slow and memory-intensive. Are there any other effective solutions to achieve this?
Upvotes: 0
Views: 155
Reputation: 4427
If you're willing to sort your list, then this is fairly trivial. Sort it first, then take the unique items. This is the same approach as sort | uniq
in shell, and can be quite memory efficient (using disk instead, of course, Python's built-in sort will be in-memory).
def unique_justseen(iterable, key=None):
"List unique elements, preserving order. Remember only the element just seen."
# unique_justseen('AAAABBBCCDAABBB') --> A B C D A B
# unique_justseen('ABBcCAD', str.lower) --> A B c A D
return map(next, map(operator.itemgetter(1), groupby(iterable, key)))
Is there a reason you care if this is sluggish? If you need to do this operation often then something is wrong in the way you are handling data.
Upvotes: 2