user3476463
user3476463

Reputation: 4575

Efficiently finding duplicates in a list

I have the function below which searches an array for duplicate entries and then returns a list of the duplicates. I'd like to speed this piece of my code up, can anyone suggest a more efficient way?

Code:

def findDupe(array):
    dupelist = []
    for i in range(len(array)):
        for j in range(len(array)):
            comp1 = array[i]
            comp2 = array[j]
            if comp1 == comp2 and i!=j:
                if comp2 not in dupelist:
                    dupelist.append(comp2)
    return dupelist

Upvotes: 7

Views: 6915

Answers (4)

barsuk
barsuk

Reputation: 1

Alternative with no imports required:

dups = set()
distinct_ids = set()
for an_id in ids:
    if an_id not in distinct_ids:
        distinct_ids.add(an_id)
    else:
        dups.add(an_id)

Upvotes: 0

mirekphd
mirekphd

Reputation: 6771

While numpy does not have the duplicated method (yet), pandas does:

import pandas as pd

df = pd.Series(input_list)

duplicated_values = df[df.duplicated()].to_list()

Upvotes: 0

Quentin Fortier
Quentin Fortier

Reputation: 163

What is the type of elements in your lists?

  1. Storing elements in a Set, as explained above, gives you average complexity Θ(n) but requires the elements to be hashable (a Set in Python is implemented with hash table, see https://wiki.python.org/moin/TimeComplexity)
  2. If you have a comparison function, you can sort the list in worst case Θ(nlog(n)) then compare each element of the list to the next.
  3. If you have a comparison function, you can also implement a set with a (balanced) BST and do the same as 1 to achieve complexity Θ(nlog(n)) in average (in worst case).

Upvotes: 0

cs95
cs95

Reputation: 402423

The idea here is to do a single sweep in linear time. You can use a counter to do this. The idea is to count each element and then return all those which were counted multiple times.

from collections import Counter

def get_duplicates(array):
    c = Counter(array)
    return [k for k in c if c[k] > 1] 

The approach above is linear in complexity, but makes two passes over the input - once, to count (this is abstracted away by the Counter constructor) and the other is to return non-unique values in the list comp. You can optimise this a bit using a yield statement, and return duplicates as you find them.

def get_duplicates(array):
    c = Counter()
    seen = set()
    for i in array: 
        c[i] += 1
        if c[i] > 1 and i not in seen:
            seen.add(i)
            yield i

This results in a compulsory if check and extra space in the form of a set, but you reduce two passes to one.

Upvotes: 7

Related Questions