Reputation: 4575
I have the function below which searches an array for duplicate entries and then returns a list of the duplicates. I'd like to speed this piece of my code up, can anyone suggest a more efficient way?
Code:
def findDupe(array):
dupelist = []
for i in range(len(array)):
for j in range(len(array)):
comp1 = array[i]
comp2 = array[j]
if comp1 == comp2 and i!=j:
if comp2 not in dupelist:
dupelist.append(comp2)
return dupelist
Upvotes: 7
Views: 6915
Reputation: 1
Alternative with no imports required:
dups = set()
distinct_ids = set()
for an_id in ids:
if an_id not in distinct_ids:
distinct_ids.add(an_id)
else:
dups.add(an_id)
Upvotes: 0
Reputation: 6771
While numpy
does not have the duplicated
method (yet), pandas
does:
import pandas as pd
df = pd.Series(input_list)
duplicated_values = df[df.duplicated()].to_list()
Upvotes: 0
Reputation: 163
What is the type of elements in your lists?
Upvotes: 0
Reputation: 402423
The idea here is to do a single sweep in linear time. You can use a counter to do this. The idea is to count each element and then return all those which were counted multiple times.
from collections import Counter
def get_duplicates(array):
c = Counter(array)
return [k for k in c if c[k] > 1]
The approach above is linear in complexity, but makes two passes over the input - once, to count (this is abstracted away by the Counter
constructor) and the other is to return non-unique values in the list comp. You can optimise this a bit using a yield
statement, and return duplicates as you find them.
def get_duplicates(array):
c = Counter()
seen = set()
for i in array:
c[i] += 1
if c[i] > 1 and i not in seen:
seen.add(i)
yield i
This results in a compulsory if
check and extra space in the form of a set
, but you reduce two passes to one.
Upvotes: 7