Reputation: 394955

How do I remove identical items from a list and sort it in Python?

How can I most optimally remove identical items from a list and sort it in Python?

Say I have a list:

my_list = ['a', 'a', 'b', 'c', 'd', 'a', 'e', 'd', 'f', 'e']

I could iterate over a copy of the list (since you should not mutate the list while iterating over it), item for item, and remove all of the item except for one:

for item in my_list[:]: # must iterate over a copy because mutating it
    count = my_list.count(item) # see how many are in the list
    if count > 1:
        for _ in range(count-1): # remove all but one of the element
            my_list.remove(item)

Which removes the redundant items:

['b', 'c', 'a', 'd', 'f', 'e']

and then sort the list:

my_list.sort()

so my_list is now:

['a', 'b', 'c', 'd', 'e', 'f']

But what's the most efficient and direct (i.e. performant) way to remove the identical elements and sort this list?

*_{This question came up at work (I wanted so badly to answer it, but one of our senior-most Python developers got to it before me), and I also brought it up at my local Python Meetup group, and few people had a good answer for it, so I'm answering it Q&A style, as suggested by Stackoverflow.}

Upvotes: 6

Answers (4)

Aaron Hall

Reputation: 394955

The best way to remove redundant elements from a list is to cast it as a set, and since sorted accepts any iterable and returns a list, this is far more efficient than doing this piecewise.

my_list = ['a', 'a', 'b', 'c', 'd', 'a', 'e', 'd', 'f', 'e']

def sorted_set(a_list):
    return sorted(set(a_list))

new_list = sorted_set(my_list)

and new_list is:

['a', 'b', 'c', 'd', 'e', 'f']

The downside of this approach is that elements given to set must be hashable, so if the elements are unhashable, you'll get an error:

>>> my_list = [['a'], ['a'], ['b'], ['c']]
>>> sorted(set(my_list))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'

This trivial case could be addressed by casting the sublists as tuples, which may be more performant than the solution in the answer, which could mean more expensive tests for equality:

>>> my_list = [tuple(i) for i in my_list]
>>> sorted(set(my_list))
[('a',), ('b',), ('c',)]

But other cases would need to find different workarounds. This would not be necessary with the other solution (but again, may be much more computationally expensive):

def remove_extras_and_sort(my_list):
    for item in my_list[:]:
        count = my_list.count(item)
        if count > 1:
            for _ in range(count-1):
                my_list.remove(item)
    my_list.sort()
    return my_list

Which works for sublists:

>>> my_list = [['a'], ['a'], ['b'], ['c']]
>>> remove_extras_and_sort(my_list)
[['a'], ['b'], ['c']]

To compare performance:

import timeit

setup = '''
my_list = ['a', 'a', 'b', 'c', 'd', 'a', 'e', 'd', 'f', 'e']
def remove_extras_and_sort(my_list):
    for item in my_list[:]:
        count = my_list.count(item)
        if count > 1:
            for _ in range(count-1):
                my_list.remove(item)
    my_list.sort()
    return my_list

def sorted_set(a_list):
    return sorted(set(a_list))
'''

timeit.timeit('sorted_set(my_list[:])', setup=setup)
timeit.timeit('remove_extras_and_sort(my_list[:])', setup=setup)

Which returns times as I measure them on my system, respectively:

1.5562372207641602
4.558010101318359

Which means that the method given in the question likely takes more than 3 times as long to compute, given the necessary overhead for copying the lists each time (if we don't copy the lists, we'll just be sorting a list that's already been sorted, since the setup only runs once).

We can disassemble each function:

import dis

def remove_extras_and_sort(my_list):
    for item in my_list[:]:
        count = my_list.count(item)
        if count > 1:
            for _ in range(count-1):
                my_list.remove(item)
    my_list.sort()
    return my_list

def sorted_set(a_list):
    return sorted(set(a_list))

And just by looking at the output, we see that the bytecode for the first function is more than six times as long:

>>> dis.dis(remove_extras_and_sort)
  2           0 SETUP_LOOP              85 (to 88)
              3 LOAD_FAST                0 (my_list)
              6 SLICE+0             
              7 GET_ITER            
        >>    8 FOR_ITER                76 (to 87)
             11 STORE_FAST               1 (item)

  3          14 LOAD_FAST                0 (my_list)
             17 LOAD_ATTR                0 (count)
             20 LOAD_FAST                1 (item)
             23 CALL_FUNCTION            1
             26 STORE_FAST               2 (count)

  4          29 LOAD_FAST                2 (count)
             32 LOAD_CONST               1 (1)
             35 COMPARE_OP               4 (>)
             38 POP_JUMP_IF_FALSE        8

  5          41 SETUP_LOOP              40 (to 84)
             44 LOAD_GLOBAL              1 (range)
             47 LOAD_FAST                2 (count)
             50 LOAD_CONST               1 (1)
             53 BINARY_SUBTRACT     
             54 CALL_FUNCTION            1
             57 GET_ITER            
        >>   58 FOR_ITER                19 (to 80)
             61 STORE_FAST               3 (_)

  6          64 LOAD_FAST                0 (my_list)
             67 LOAD_ATTR                2 (remove)
             70 LOAD_FAST                1 (item)
             73 CALL_FUNCTION            1
             76 POP_TOP             
             77 JUMP_ABSOLUTE           58
        >>   80 POP_BLOCK           
             81 JUMP_ABSOLUTE            8
        >>   84 JUMP_ABSOLUTE            8
        >>   87 POP_BLOCK           

  7     >>   88 LOAD_FAST                0 (my_list)
             91 LOAD_ATTR                3 (sort)
             94 CALL_FUNCTION            0
             97 POP_TOP             

  8          98 LOAD_FAST                0 (my_list)
            101 RETURN_VALUE

And the recommended way has much shorter bytecode:

>>> dis.dis(sorted_set)
  2           0 LOAD_GLOBAL              0 (sorted)
              3 LOAD_GLOBAL              1 (set)
              6 LOAD_FAST                0 (a_list)
              9 CALL_FUNCTION            1
             12 CALL_FUNCTION            1
             15 RETURN_VALUE

So we see that using the builtin functionality of Python is much more effective and efficient than trying to reinvent the wheel.

Addendum: other options that need to be more fully explored:

def groupby_sorted(my_list):
    """if items in my_list are unhashable"""
    from itertools import groupby
    return [e for e, g in groupby(sorted(my_list))]

def preserve_order_encountered(my_list):
    """elements in argument must be hashable - preserves order encountered"""
    from collections import OrderedDict
    return list(OrderedDict.fromkeys(my_list))

Upvotes: 16

Pavan

Reputation: 348

my_list = ['a', 'a', 'b', 'c', 'd', 'a', 'e', 'd', 'f', 'e']
b=[]
for x in my_list:
    try:
       z=b.index(x)
    except:
       b.append(x)


b.sort()
output
['a', 'b', 'c', 'd', 'e', 'f']

Upvotes: -1

ecatmur

Reputation: 157334

Placing the items into a set and then sorting is going to be efficient, but it does rely on the items being hashable:

def sorted_set(a_list):
    return sorted(set(a_list))

timeit sorted_set(my_list)
100000 loops, best of 3: 3.19 µs per loop

The classic way to get a sorted list of unique elements is first to sort, then to perform a second pass over the list, eliminating identical elements (which are guaranteed to be adjacent after the sort):

def sorted_unique(a_list):
    l = sorted(a_list)
    return l[:1] + [b for a, b in zip(l, l[1:]) if a != b]

This is not too bad compared to using set:

timeit sorted_unique(my_list)
100000 loops, best of 3: 6.6 µs per loop

We can actually do better using itertools.groupby:

def sorted_group(a_list):
    return [k for k, _ in groupby(sorted(a_list))]

timeit sorted_group(my_list)
100000 loops, best of 3: 5.3 µs per loop

Finally, if the items are primitive values it's worth considering numpy; in this case (on a small list) the overheads outweigh any benefit, but it performs well on larger problem sets:

def sorted_np(a_list):
    return np.unique(np.sort(a_list))

timeit sorted_np(my_list)
10000 loops, best of 3: 42 µs per loop

my_list = [random.randint(0, 10**6) for _ in range(10**6)]

timeit sorted_set(my_list)
1 loops, best of 3: 454 ms per loop

timeit sorted_np(my_list)
1 loops, best of 3: 333 ms per loop

Upvotes: 2

Ammar

Reputation: 1324

It is one two simple functions in python:

my_list = ['a', 'a', 'b', 'c', 'd', 'a', 'e', 'd', 'f', 'e']
print sorted(set(my_list))

and you get what you want ;)

if you want more info regarding sets look here, and about sorting in python have a look here.

hope this helps.

Upvotes: 1

How do I remove identical items from a list and sort it in Python?

Answers (4)

To compare performance:

We can disassemble each function:

Related Questions