Reputation: 6315

How do I check if there are duplicates in a flat list?

For example, given the list ['one', 'two', 'one'], the algorithm should return True, whereas given ['one', 'two', 'three'] it should return False.

Upvotes: 265

Answers (16)

Stefan Pochmann

Reputation: 28606

A few more creative solutions...

Solution 1: Batches

The simple len(set(xs)) < len(xs) solution is great for large lists without duplicates, but bad if there is an early duplicate where other solutions stop early without even looking at the rest. We can get fast times in both cases by working in batches, i.e., a middle-ground between "all at once" and "one by one":

def batches(xs):
    seen = set()
    n = 0
    for batch in batched(xs, 100):
        n += len(batch)
        seen.update(batch)
        if len(seen) < n:
            return True
    return False

Uses itertools.batched. I chose batch size 100 because that was still 10% faster than batch size 50, but beyond 100 there was no noticeable speed-up.

As free bonus, this works not only with lists or other collections that support len(xs), but also with other iterables.

Solution 2: `set.isdisjoint`

def isdisjoint(xs):
    seen = set()
    ahead = iter(xs)
    next(ahead, None)
    return not seen.isdisjoint(compress(
        ahead,
        zip(map(seen.add, xs))
    ))

Let's say xs is [1, 2, 3, 4, ...]. My ahead iterator is one item ahead, so it yields 2, 3, 4, etc. My zip(map(seen.add, xs)) adds 1, 2, 3, etc to the set, and always yields (None,). So in the first step, 1 gets added to the set and then its isdisjoint checks whether it contains 2. Then in the second step, 2 gets added to the set and then its isdisjoint checks whether it contains 3. And so on. (Due to the shift, this doesn't check whether the first element is already in the set and doesn't add the last element to the set. That is ok/good.)

Uses set.isdisjoint and itertools.compress.

Solution 3: Another compress

Similar to solution 2. But here, compress only yields when a duplicate is found.

def compress_contains(xs):
    seen = set()
    ahead = iter(xs)
    next(ahead, None)
    return None in compress(
        map(seen.add, xs),
        map(seen.__contains__, ahead)
    )

Solution 4: Shorter but slower

Turned out not quite as fast as the others, but it's a bit shorter and I still find it interesting.

def shorter_but_slower(xs):
    seen = set()
    return any(compress(
        map(seen.__contains__, xs),
        zip(map(seen.add, xs))
    ))

Benchmark results

A case without duplicates:

xs = list(range(10**5))

   3.0 ± 0.0 ms  set_size
   3.9 ± 0.0 ms  batches
   7.1 ± 0.1 ms  isdisjoint
   7.4 ± 0.1 ms  loop_optimized
   7.5 ± 0.1 ms  loop
   7.7 ± 0.1 ms  compress_contains
   8.8 ± 0.1 ms  shorter_but_slower

With an immediate duplicate:

xs = [0, *range(10**5)]

   0.3 ± 0.0 μs  loop
   0.4 ± 0.0 μs  loop_optimized
   0.8 ± 0.0 μs  isdisjoint
   0.8 ± 0.0 μs  compress_contains
   1.0 ± 0.0 μs  shorter_but_slower
   3.2 ± 0.0 μs  batches
2928.6 ± 70.3 μs  set_size

Python: 3.13.0 (main, Nov  9 2024, 10:04:25) [GCC 14.2.1 20240910]

Benchmark code

def batches(xs):
    seen = set()
    n = 0
    for batch in batched(xs, 100):
        n += len(batch)
        seen.update(batch)
        if len(seen) < n:
            return True
    return False


def isdisjoint(xs):
    seen = set()
    ahead = iter(xs)
    next(ahead, None)
    return not seen.isdisjoint(compress(
        ahead,
        zip(map(seen.add, xs))
    ))


def compress_contains(xs):
    seen = set()
    ahead = iter(xs)
    next(ahead, None)
    return None in compress(
        map(seen.add, xs),
        map(seen.__contains__, ahead)
    )


def shorter_but_slower(xs):
    seen = set()
    return any(compress(
        map(seen.__contains__, xs),
        zip(map(seen.add, xs))
    ))


def set_size(xs):
    return len(set(xs)) < len(xs)


def loop(xs):
    seen = set()
    for x in xs:
        if x in seen:
            return True
        seen.add(x)
    return False


def loop_optimized(xs):
    seen = set()
    add = seen.add
    for x in xs:
        if x in seen:
            return True
        seen.add(x)
    return False


funcs = [
    batches,
    isdisjoint,
    compress_contains,
    shorter_but_slower,
    set_size,
    loop,
    loop_optimized,
]

from itertools import *
from time import perf_counter as time
from statistics import mean, stdev
import sys
import random

# Correctness
for n in range(7):
    for xs in product(range(7), repeat=n):
        expect = set_size(xs)
        for f in funcs:
            if f(xs) is not expect:
                print('fail', f.__name__)

# Speed
def benchmark(xs_str, reps, unit, scale):
    print('xs =', xs_str)
    print()
    xs = eval(xs_str)
    times = {f: [] for f in funcs}
    def stats(f):
        ts = [t * scale for t in sorted(times[f])[:5]]
        return f'{mean(ts):6.1f} ± {stdev(ts):3.1f} {unit} '
    for _ in range(100):
        random.shuffle(funcs)
        for f in funcs:
            t = time()
            for _ in range(reps):
                f(xs)
            times[f].append((time() - t) / reps)
    for f in sorted(funcs, key=stats):
        print(stats(f), f.__name__)
    print()

benchmark('list(range(10**5))', 1, 'ms', 1e3)
benchmark('[0, *range(10**5)]', 10, 'μs', 1e6)

print('Python:', sys.version)

Attempt This Online!

Upvotes: 0

moon prism power

Reputation: 2347

Another solution is to use slicing, which will also work with strings and other enumerable things.

def has_duplicates(x):
    for idx, item in enumerate(x):
        if item in x[(idx + 1):]:
            return True
    return False


>>> has_duplicates(["a", "b", "c"])
False
>>> has_duplicates(["a", "b", "b", "c"])
True
>>> has_duplicates("abc")
False
>>> has_duplicates("abbc")
True

Upvotes: 1

ahmed meraj

Reputation: 886

def check_duplicates(my_list):
    seen = {}
    for item in my_list:
        if seen.get(item):
            return True
        seen[item] = True
    return False

Upvotes: 0

Jay Desai

Reputation: 475

my_list = ['one', 'two', 'one']

duplicates = []

for value in my_list:
  if my_list.count(value) > 1:
    if value not in duplicates:
      duplicates.append(value)

print(duplicates) //["one"]

Upvotes: 3

Hewey Dewey

Reputation: 149

I used pyrospade's approach, for its simplicity, and modified that slightly on a short list made from the case-insensitive Windows registry.

If the raw PATH value string is split into individual paths all 'null' paths (empty or whitespace-only strings) can be removed by using:

PATH_nonulls = [s for s in PATH if s.strip()]

def HasDupes(aseq) :
    s = set()
    return any(((x.lower() in s) or s.add(x.lower())) for x in aseq)

def GetDupes(aseq) :
    s = set()
    return set(x for x in aseq if ((x.lower() in s) or s.add(x.lower())))

def DelDupes(aseq) :
    seen = set()
    return [x for x in aseq if (x.lower() not in seen) and (not seen.add(x.lower()))]

The original PATH has both 'null' entries and duplicates for testing purposes:

[list]  Root paths in HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Environment:PATH[list]  Root paths in HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Environment
  1  C:\Python37\
  2
  3
  4  C:\Python37\Scripts\
  5  c:\python37\
  6  C:\Program Files\ImageMagick-7.0.8-Q8
  7  C:\Program Files (x86)\poppler\bin
  8  D:\DATA\Sounds
  9  C:\Program Files (x86)\GnuWin32\bin
 10  C:\Program Files (x86)\Intel\iCLS Client\
 11  C:\Program Files\Intel\iCLS Client\
 12  D:\DATA\CCMD\FF
 13  D:\DATA\CCMD
 14  D:\DATA\UTIL
 15  C:\
 16  D:\DATA\UHELP
 17  %SystemRoot%\system32
 18
 19
 20  D:\DATA\CCMD\FF%SystemRoot%
 21  D:\DATA\Sounds
 22  %SystemRoot%\System32\Wbem
 23  D:\DATA\CCMD\FF
 24
 25
 26  c:\
 27  %SYSTEMROOT%\System32\WindowsPowerShell\v1.0\
 28

Null paths have been removed, but still has duplicates, e.g., (1, 3) and (13, 20):

    [list]  Null paths removed from HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Environment:PATH
  1  C:\Python37\
  2  C:\Python37\Scripts\
  3  c:\python37\
  4  C:\Program Files\ImageMagick-7.0.8-Q8
  5  C:\Program Files (x86)\poppler\bin
  6  D:\DATA\Sounds
  7  C:\Program Files (x86)\GnuWin32\bin
  8  C:\Program Files (x86)\Intel\iCLS Client\
  9  C:\Program Files\Intel\iCLS Client\
 10  D:\DATA\CCMD\FF
 11  D:\DATA\CCMD
 12  D:\DATA\UTIL
 13  C:\
 14  D:\DATA\UHELP
 15  %SystemRoot%\system32
 16  D:\DATA\CCMD\FF%SystemRoot%
 17  D:\DATA\Sounds
 18  %SystemRoot%\System32\Wbem
 19  D:\DATA\CCMD\FF
 20  c:\
 21  %SYSTEMROOT%\System32\WindowsPowerShell\v1.0\

And finally, the dupes have been removed:

[list]  Massaged path list from in HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Environment:PATH
  1  C:\Python37\
  2  C:\Python37\Scripts\
  3  C:\Program Files\ImageMagick-7.0.8-Q8
  4  C:\Program Files (x86)\poppler\bin
  5  D:\DATA\Sounds
  6  C:\Program Files (x86)\GnuWin32\bin
  7  C:\Program Files (x86)\Intel\iCLS Client\
  8  C:\Program Files\Intel\iCLS Client\
  9  D:\DATA\CCMD\FF
 10  D:\DATA\CCMD
 11  D:\DATA\UTIL
 12  C:\
 13  D:\DATA\UHELP
 14  %SystemRoot%\system32
 15  D:\DATA\CCMD\FF%SystemRoot%
 16  %SystemRoot%\System32\Wbem
 17  %SYSTEMROOT%\System32\WindowsPowerShell\v1.0\

Upvotes: 1

Evandro Coan

Reputation: 9428

I found this to do the best performance because it short-circuit the operation when the first duplicated it found, then this algorithm has time and space complexity O(n) where n is the list's length:

def has_duplicated_elements(iterable):
    """ Given an `iterable`, return True if there are duplicated entries. """
    clean_elements_set = set()
    clean_elements_set_add = clean_elements_set.add

    for possible_duplicate_element in iterable:

        if possible_duplicate_element in clean_elements_set:
            return True

        else:
            clean_elements_set_add( possible_duplicate_element )

    return False

Upvotes: 2

F1Rumors

Reputation: 948

I recently answered a related question to establish all the duplicates in a list, using a generator. It has the advantage that if used just to establish 'if there is a duplicate' then you just need to get the first item and the rest can be ignored, which is the ultimate shortcut.

This is an interesting set based approach I adapted straight from moooeeeep:

def getDupes(l):
    seen = set()
    seen_add = seen.add
    for x in l:
        if x in seen or seen_add(x):
            yield x

Accordingly, a full list of dupes would be list(getDupes(etc)). To simply test "if" there is a dupe, it should be wrapped as follows:

def hasDupes(l):
    try:
        if getDupes(l).next(): return True    # Found a dupe
    except StopIteration:
        pass
    return False

This scales well and provides consistent operating times wherever the dupe is in the list -- I tested with lists of up to 1m entries. If you know something about the data, specifically, that dupes are likely to show up in the first half, or other things that let you skew your requirements, like needing to get the actual dupes, then there are a couple of really alternative dupe locators that might outperform. The two I recommend are...

Simple dict based approach, very readable:

def getDupes(c):
    d = {}
    for i in c:
        if i in d:
            if d[i]:
                yield i
                d[i] = False
        else:
            d[i] = True

Leverage itertools (essentially an ifilter/izip/tee) on the sorted list, very efficient if you are getting all the dupes though not as quick to get just the first:

def getDupes(c):
    a, b = itertools.tee(sorted(c))
    next(b, None)
    r = None
    for k, g in itertools.ifilter(lambda x: x[0]==x[1], itertools.izip(a, b)):
        if k != r:
            yield k
            r = k

These were the top performers from the approaches I tried for the full dupe list, with the first dupe occurring anywhere in a 1m element list from the start to the middle. It was surprising how little overhead the sort step added. Your mileage may vary, but here are my specific timed results:

Finding FIRST duplicate, single dupe places "n" elements in to 1m element array

Test set len change :        50 -  . . . . .  -- 0.002
Test in dict        :        50 -  . . . . .  -- 0.002
Test in set         :        50 -  . . . . .  -- 0.002
Test sort/adjacent  :        50 -  . . . . .  -- 0.023
Test sort/groupby   :        50 -  . . . . .  -- 0.026
Test sort/zip       :        50 -  . . . . .  -- 1.102
Test sort/izip      :        50 -  . . . . .  -- 0.035
Test sort/tee/izip  :        50 -  . . . . .  -- 0.024
Test moooeeeep      :        50 -  . . . . .  -- 0.001 *
Test iter*/sorted   :        50 -  . . . . .  -- 0.027

Test set len change :      5000 -  . . . . .  -- 0.017
Test in dict        :      5000 -  . . . . .  -- 0.003 *
Test in set         :      5000 -  . . . . .  -- 0.004
Test sort/adjacent  :      5000 -  . . . . .  -- 0.031
Test sort/groupby   :      5000 -  . . . . .  -- 0.035
Test sort/zip       :      5000 -  . . . . .  -- 1.080
Test sort/izip      :      5000 -  . . . . .  -- 0.043
Test sort/tee/izip  :      5000 -  . . . . .  -- 0.031
Test moooeeeep      :      5000 -  . . . . .  -- 0.003 *
Test iter*/sorted   :      5000 -  . . . . .  -- 0.031

Test set len change :     50000 -  . . . . .  -- 0.035
Test in dict        :     50000 -  . . . . .  -- 0.023
Test in set         :     50000 -  . . . . .  -- 0.023
Test sort/adjacent  :     50000 -  . . . . .  -- 0.036
Test sort/groupby   :     50000 -  . . . . .  -- 0.134
Test sort/zip       :     50000 -  . . . . .  -- 1.121
Test sort/izip      :     50000 -  . . . . .  -- 0.054
Test sort/tee/izip  :     50000 -  . . . . .  -- 0.045
Test moooeeeep      :     50000 -  . . . . .  -- 0.019 *
Test iter*/sorted   :     50000 -  . . . . .  -- 0.055

Test set len change :    500000 -  . . . . .  -- 0.249
Test in dict        :    500000 -  . . . . .  -- 0.145
Test in set         :    500000 -  . . . . .  -- 0.165
Test sort/adjacent  :    500000 -  . . . . .  -- 0.139
Test sort/groupby   :    500000 -  . . . . .  -- 1.138
Test sort/zip       :    500000 -  . . . . .  -- 1.159
Test sort/izip      :    500000 -  . . . . .  -- 0.126
Test sort/tee/izip  :    500000 -  . . . . .  -- 0.120 *
Test moooeeeep      :    500000 -  . . . . .  -- 0.131
Test iter*/sorted   :    500000 -  . . . . .  -- 0.157

Upvotes: 5

MSeifert

Reputation: 152647

I thought it would be useful to compare the timings of the different solutions presented here. For this I used my own library simple_benchmark:

So indeed for this case the solution from Denis Otkidach is fastest.

Some of the approaches also exhibit a much steeper curve, these are the approaches that scale quadratic with the number of elements (Alex Martellis first solution, wjandrea and both of Xavier Decorets solutions). Also important to mention is that the pandas solution from Keiku has a very big constant factor. But for larger lists it almost catches up with the other solutions.

And in case the duplicate is at the first position. This is useful to see which solutions are short-circuiting:

Here several approaches don't short-circuit: Kaiku, Frank, Xavier_Decoret (first solution), Turn, Alex Martelli (first solution) and the approach presented by Denis Otkidach (which was fastest in the no-duplicate case).

I included a function from my own library here: iteration_utilities.all_distinct which can compete with the fastest solution in the no-duplicates case and performs in constant-time for the duplicate-at-begin case (although not as fastest).

The code for the benchmark:

from collections import Counter
from functools import reduce

import pandas as pd
from simple_benchmark import BenchmarkBuilder
from iteration_utilities import all_distinct

b = BenchmarkBuilder()

@b.add_function()
def Keiku(l):
    return pd.Series(l).duplicated().sum() > 0

@b.add_function()
def Frank(num_list):
    unique = []
    dupes = []
    for i in num_list:
        if i not in unique:
            unique.append(i)
        else:
            dupes.append(i)
    if len(dupes) != 0:
        return False
    else:
        return True

@b.add_function()
def wjandrea(iterable):
    seen = []
    for x in iterable:
        if x in seen:
            return True
        seen.append(x)
    return False

@b.add_function()
def user(iterable):
    clean_elements_set = set()
    clean_elements_set_add = clean_elements_set.add

    for possible_duplicate_element in iterable:

        if possible_duplicate_element in clean_elements_set:
            return True

        else:
            clean_elements_set_add( possible_duplicate_element )

    return False

@b.add_function()
def Turn(l):
    return Counter(l).most_common()[0][1] > 1

def getDupes(l):
    seen = set()
    seen_add = seen.add
    for x in l:
        if x in seen or seen_add(x):
            yield x

@b.add_function()          
def F1Rumors(l):
    try:
        if next(getDupes(l)): return True    # Found a dupe
    except StopIteration:
        pass
    return False

def decompose(a_list):
    return reduce(
        lambda u, o : (u[0].union([o]), u[1].union(u[0].intersection([o]))),
        a_list,
        (set(), set()))

@b.add_function()
def Xavier_Decoret_1(l):
    return not decompose(l)[1]

@b.add_function()
def Xavier_Decoret_2(l):
    try:
        def func(s, o):
            if o in s:
                raise Exception
            return s.union([o])
        reduce(func, l, set())
        return True
    except:
        return False

@b.add_function()
def pyrospade(xs):
    s = set()
    return any(x in s or s.add(x) for x in xs)

@b.add_function()
def Alex_Martelli_1(thelist):
    return any(thelist.count(x) > 1 for x in thelist)

@b.add_function()
def Alex_Martelli_2(thelist):
    seen = set()
    for x in thelist:
        if x in seen: return True
        seen.add(x)
    return False

@b.add_function()
def Denis_Otkidach(your_list):
    return len(your_list) != len(set(your_list))

@b.add_function()
def MSeifert04(l):
    return not all_distinct(l)

And for the arguments:


# No duplicate run
@b.add_arguments('list size')
def arguments():
    for exp in range(2, 14):
        size = 2**exp
        yield size, list(range(size))

# Duplicate at beginning run
@b.add_arguments('list size')
def arguments():
    for exp in range(2, 14):
        size = 2**exp
        yield size, [0, *list(range(size)]

# Running and plotting
r = b.run()
r.plot()

Upvotes: 40

wjandrea

Reputation: 32963

If the list contains unhashable items, you can use Alex Martelli's solution but with a list instead of a set, though it's slower for larger inputs: O(N^2).

def has_duplicates(iterable):
    seen = []
    for x in iterable:
        if x in seen:
            return True
        seen.append(x)
    return False

Upvotes: 1

Keiku

Reputation: 8803

A more simple solution is as follows. Just check True/False with pandas .duplicated() method and then take sum. Please also see pandas.Series.duplicated — pandas 0.24.1 documentation

import pandas as pd

def has_duplicated(l):
    return pd.Series(l).duplicated().sum() > 0

print(has_duplicated(['one', 'two', 'one']))
# True
print(has_duplicated(['one', 'two', 'three']))
# False

Upvotes: 0

Christopher Sparks

Reputation: 65

I dont really know what set does behind the scenes, so I just like to keep it simple.

def dupes(num_list):
    unique = []
    dupes = []
    for i in num_list:
        if i not in unique:
            unique.append(i)
        else:
            dupes.append(i)
    if len(dupes) != 0:
        return False
    else:
        return True

Upvotes: -1

Turn

Reputation: 7020

Another way of doing this succinctly is with Counter.

To just determine if there are any duplicates in the original list:

from collections import Counter

def has_dupes(l):
    # second element of the tuple has number of repetitions
    return Counter(l).most_common()[0][1] > 1

Or to get a list of items that have duplicates:

def get_dupes(l):
    return [k for k, v in Counter(l).items() if v > 1]

Upvotes: 4

Denis Otkidach

Reputation: 33200

Use set() to remove duplicates if all values are hashable:

>>> your_list = ['one', 'two', 'one']
>>> len(your_list) != len(set(your_list))
True

Upvotes: 528

pyrospade

Reputation: 8078

This is old, but the answers here led me to a slightly different solution. If you are up for abusing comprehensions, you can get short-circuiting this way.

xs = [1, 2, 1]
s = set()
any(x in s or s.add(x) for x in xs)
# You can use a similar approach to actually retrieve the duplicates.
s = set()
duplicates = set(x for x in xs if x in s or s.add(x))

Upvotes: 14

Xavier Decoret

Reputation: 519

If you are fond of functional programming style, here is a useful function, self-documented and tested code using doctest.

def decompose(a_list):
    """Turns a list into a set of all elements and a set of duplicated elements.

    Returns a pair of sets. The first one contains elements
    that are found at least once in the list. The second one
    contains elements that appear more than once.

    >>> decompose([1,2,3,5,3,2,6])
    (set([1, 2, 3, 5, 6]), set([2, 3]))
    """
    return reduce(
        lambda (u, d), o : (u.union([o]), d.union(u.intersection([o]))),
        a_list,
        (set(), set()))

if __name__ == "__main__":
    import doctest
    doctest.testmod()

From there you can test unicity by checking whether the second element of the returned pair is empty:

def is_set(l):
    """Test if there is no duplicate element in l.

    >>> is_set([1,2,3])
    True
    >>> is_set([1,2,1])
    False
    >>> is_set([])
    True
    """
    return not decompose(l)[1]

Note that this is not efficient since you are explicitly constructing the decomposition. But along the line of using reduce, you can come up to something equivalent (but slightly less efficient) to answer 5:

def is_set(l):
    try:
        def func(s, o):
            if o in s:
                raise Exception
            return s.union([o])
        reduce(func, l, set())
        return True
    except:
        return False

Upvotes: 9

Alex Martelli

Reputation: 881645

Recommended for short lists only:

any(thelist.count(x) > 1 for x in thelist)

Do not use on a long list -- it can take time proportional to the square of the number of items in the list!

For longer lists with hashable items (strings, numbers, &c):

def anydup(thelist):
  seen = set()
  for x in thelist:
    if x in seen: return True
    seen.add(x)
  return False

If your items are not hashable (sublists, dicts, etc) it gets hairier, though it may still be possible to get O(N logN) if they're at least comparable. But you need to know or test the characteristics of the items (hashable or not, comparable or not) to get the best performance you can -- O(N) for hashables, O(N log N) for non-hashable comparables, otherwise it's down to O(N squared) and there's nothing one can do about it:-(.

Upvotes: 70

How do I check if there are duplicates in a flat list?

Answers (16)

Solution 1: Batches

Solution 2: set.isdisjoint

Solution 3: Another compress

Solution 4: Shorter but slower

Benchmark results

Benchmark code

Related Questions

Solution 2: `set.isdisjoint`