Reputation: 394955
How can I most optimally remove identical items from a list and sort it in Python?
Say I have a list:
my_list = ['a', 'a', 'b', 'c', 'd', 'a', 'e', 'd', 'f', 'e']
I could iterate over a copy of the list (since you should not mutate the list while iterating over it), item for item, and remove all of the item except for one:
for item in my_list[:]: # must iterate over a copy because mutating it
count = my_list.count(item) # see how many are in the list
if count > 1:
for _ in range(count-1): # remove all but one of the element
my_list.remove(item)
Which removes the redundant items:
['b', 'c', 'a', 'd', 'f', 'e']
and then sort the list:
my_list.sort()
so my_list is now:
['a', 'b', 'c', 'd', 'e', 'f']
But what's the most efficient and direct (i.e. performant) way to remove the identical elements and sort this list?
*This question came up at work (I wanted so badly to answer it, but one of our senior-most Python developers got to it before me), and I also brought it up at my local Python Meetup group, and few people had a good answer for it, so I'm answering it Q&A style, as suggested by Stackoverflow.
Upvotes: 6
Views: 1031
Reputation: 394955
The best way to remove redundant elements from a list is to cast it as a set, and since sorted accepts any iterable and returns a list, this is far more efficient than doing this piecewise.
my_list = ['a', 'a', 'b', 'c', 'd', 'a', 'e', 'd', 'f', 'e']
def sorted_set(a_list):
return sorted(set(a_list))
new_list = sorted_set(my_list)
and new_list is:
['a', 'b', 'c', 'd', 'e', 'f']
The downside of this approach is that elements given to set must be hashable, so if the elements are unhashable, you'll get an error:
>>> my_list = [['a'], ['a'], ['b'], ['c']]
>>> sorted(set(my_list))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
This trivial case could be addressed by casting the sublists as tuples, which may be more performant than the solution in the answer, which could mean more expensive tests for equality:
>>> my_list = [tuple(i) for i in my_list]
>>> sorted(set(my_list))
[('a',), ('b',), ('c',)]
But other cases would need to find different workarounds. This would not be necessary with the other solution (but again, may be much more computationally expensive):
def remove_extras_and_sort(my_list):
for item in my_list[:]:
count = my_list.count(item)
if count > 1:
for _ in range(count-1):
my_list.remove(item)
my_list.sort()
return my_list
Which works for sublists:
>>> my_list = [['a'], ['a'], ['b'], ['c']]
>>> remove_extras_and_sort(my_list)
[['a'], ['b'], ['c']]
import timeit
setup = '''
my_list = ['a', 'a', 'b', 'c', 'd', 'a', 'e', 'd', 'f', 'e']
def remove_extras_and_sort(my_list):
for item in my_list[:]:
count = my_list.count(item)
if count > 1:
for _ in range(count-1):
my_list.remove(item)
my_list.sort()
return my_list
def sorted_set(a_list):
return sorted(set(a_list))
'''
timeit.timeit('sorted_set(my_list[:])', setup=setup)
timeit.timeit('remove_extras_and_sort(my_list[:])', setup=setup)
Which returns times as I measure them on my system, respectively:
1.5562372207641602
4.558010101318359
Which means that the method given in the question likely takes more than 3 times as long to compute, given the necessary overhead for copying the lists each time (if we don't copy the lists, we'll just be sorting a list that's already been sorted, since the setup only runs once).
import dis
def remove_extras_and_sort(my_list):
for item in my_list[:]:
count = my_list.count(item)
if count > 1:
for _ in range(count-1):
my_list.remove(item)
my_list.sort()
return my_list
def sorted_set(a_list):
return sorted(set(a_list))
And just by looking at the output, we see that the bytecode for the first function is more than six times as long:
>>> dis.dis(remove_extras_and_sort)
2 0 SETUP_LOOP 85 (to 88)
3 LOAD_FAST 0 (my_list)
6 SLICE+0
7 GET_ITER
>> 8 FOR_ITER 76 (to 87)
11 STORE_FAST 1 (item)
3 14 LOAD_FAST 0 (my_list)
17 LOAD_ATTR 0 (count)
20 LOAD_FAST 1 (item)
23 CALL_FUNCTION 1
26 STORE_FAST 2 (count)
4 29 LOAD_FAST 2 (count)
32 LOAD_CONST 1 (1)
35 COMPARE_OP 4 (>)
38 POP_JUMP_IF_FALSE 8
5 41 SETUP_LOOP 40 (to 84)
44 LOAD_GLOBAL 1 (range)
47 LOAD_FAST 2 (count)
50 LOAD_CONST 1 (1)
53 BINARY_SUBTRACT
54 CALL_FUNCTION 1
57 GET_ITER
>> 58 FOR_ITER 19 (to 80)
61 STORE_FAST 3 (_)
6 64 LOAD_FAST 0 (my_list)
67 LOAD_ATTR 2 (remove)
70 LOAD_FAST 1 (item)
73 CALL_FUNCTION 1
76 POP_TOP
77 JUMP_ABSOLUTE 58
>> 80 POP_BLOCK
81 JUMP_ABSOLUTE 8
>> 84 JUMP_ABSOLUTE 8
>> 87 POP_BLOCK
7 >> 88 LOAD_FAST 0 (my_list)
91 LOAD_ATTR 3 (sort)
94 CALL_FUNCTION 0
97 POP_TOP
8 98 LOAD_FAST 0 (my_list)
101 RETURN_VALUE
And the recommended way has much shorter bytecode:
>>> dis.dis(sorted_set)
2 0 LOAD_GLOBAL 0 (sorted)
3 LOAD_GLOBAL 1 (set)
6 LOAD_FAST 0 (a_list)
9 CALL_FUNCTION 1
12 CALL_FUNCTION 1
15 RETURN_VALUE
So we see that using the builtin functionality of Python is much more effective and efficient than trying to reinvent the wheel.
Addendum: other options that need to be more fully explored:
def groupby_sorted(my_list):
"""if items in my_list are unhashable"""
from itertools import groupby
return [e for e, g in groupby(sorted(my_list))]
def preserve_order_encountered(my_list):
"""elements in argument must be hashable - preserves order encountered"""
from collections import OrderedDict
return list(OrderedDict.fromkeys(my_list))
Upvotes: 16
Reputation: 348
my_list = ['a', 'a', 'b', 'c', 'd', 'a', 'e', 'd', 'f', 'e']
b=[]
for x in my_list:
try:
z=b.index(x)
except:
b.append(x)
b.sort()
output
['a', 'b', 'c', 'd', 'e', 'f']
Upvotes: -1
Reputation: 157334
Placing the items into a set and then sorting is going to be efficient, but it does rely on the items being hashable:
def sorted_set(a_list):
return sorted(set(a_list))
timeit sorted_set(my_list)
100000 loops, best of 3: 3.19 µs per loop
The classic way to get a sorted list of unique elements is first to sort, then to perform a second pass over the list, eliminating identical elements (which are guaranteed to be adjacent after the sort):
def sorted_unique(a_list):
l = sorted(a_list)
return l[:1] + [b for a, b in zip(l, l[1:]) if a != b]
This is not too bad compared to using set
:
timeit sorted_unique(my_list)
100000 loops, best of 3: 6.6 µs per loop
We can actually do better using itertools.groupby
:
def sorted_group(a_list):
return [k for k, _ in groupby(sorted(a_list))]
timeit sorted_group(my_list)
100000 loops, best of 3: 5.3 µs per loop
Finally, if the items are primitive values it's worth considering numpy; in this case (on a small list) the overheads outweigh any benefit, but it performs well on larger problem sets:
def sorted_np(a_list):
return np.unique(np.sort(a_list))
timeit sorted_np(my_list)
10000 loops, best of 3: 42 µs per loop
my_list = [random.randint(0, 10**6) for _ in range(10**6)]
timeit sorted_set(my_list)
1 loops, best of 3: 454 ms per loop
timeit sorted_np(my_list)
1 loops, best of 3: 333 ms per loop
Upvotes: 2
Reputation: 1324
It is one two simple functions in python:
my_list = ['a', 'a', 'b', 'c', 'd', 'a', 'e', 'd', 'f', 'e']
print sorted(set(my_list))
and you get what you want ;)
if you want more info regarding sets look here, and about sorting in python have a look here.
hope this helps.
Upvotes: 1