How can I efficiently remove duplicates from a large list in Python?

Question

I need to remove every duplicate item from a list of more than 100 million things. I tried converting the list to a set and back again using the Set method, but it is far too sluggish and slow and memory-intensive. Are there any other effective solutions to achieve this?

Cireo · Accepted Answer

If you're willing to sort your list, then this is fairly trivial. Sort it first, then take the unique items. This is the same approach as sort | uniq in shell, and can be quite memory efficient (using disk instead, of course, Python's built-in sort will be in-memory).

Itertools Recipes

def unique_justseen(iterable, key=None):
    "List unique elements, preserving order. Remember only the element just seen."
    # unique_justseen('AAAABBBCCDAABBB') --> A B C D A B
    # unique_justseen('ABBcCAD', str.lower) --> A B c A D
    return map(next, map(operator.itemgetter(1), groupby(iterable, key)))

Is there a reason you care if this is sluggish? If you need to do this operation often then something is wrong in the way you are handling data.

How can I efficiently remove duplicates from a large list in Python?

Answers (1)

Related Questions