Reputation: 46530
If I have a python list that is has many duplicates, and I want to iterate through each item, but not through the duplicates, is it best to use a set (as in set(mylist)
, or find another way to create a list without duplicates? I was thinking of just looping through the list and checking for duplicates but I figured that's what set()
does when it's initialized.
So if mylist = [3,1,5,2,4,4,1,4,2,5,1,3]
and I really just want to loop through [1,2,3,4,5]
(order doesn't matter), should I use set(mylist)
or something else?
An alternative is possible in the last example, since the list contains every integer between its min and max value, I could loop through range(min(mylist),max(mylist))
or through set(mylist)
. Should I generally try to avoid using set in this case? Also, would finding the min
and max
be slower than just creating the set
?
In the case in the last example, the set
is faster:
from numpy.random import random_integers
ids = random_integers(1e3,size=1e6)
def set_loop(mylist):
idlist = []
for id in set(mylist):
idlist.append(id)
return idlist
def list_loop(mylist):
idlist = []
for id in range(min(mylist),max(mylist)):
idlist.append(id)
return idlist
%timeit set_loop(ids)
#1 loops, best of 3: 232 ms per loop
%timeit list_loop(ids)
#1 loops, best of 3: 408 ms per loop
Upvotes: 40
Views: 106128
Reputation: 16
If the list is part of a larger codebase and needs to be as is then all these responses are right you simply have to convert the list into a set with either the set()
function or a set comprehnsion { item for item in list }
. However if this is YOUR code base and you know that the list is only going to be used in THIS context you can go back to where the list is defined and define it as a set instead of a list. This means that you have a smaller dataset to begin with and would indeed be faster than taking a list, iterating through it, and then creating a set out of that.
The other thing to consider here is whether or not that the dataset can be defined as a set comprehension as described above and if not then the scope of where the dataset needs to be a list and where it needs to be a set.
If you were to for example call set()
on the list inside a loop. Then every time the loop would be called the list would be iterated through and a set would be created. If instead you created the set outside of the scope of the loop you would only iterate through the list once and create one set.
set()
- Takes an iterable argument then returns a set version of that argument while not mutating the original. This means you end up with two datasets instead of one. The iterable that was passed to set()
and the new set that is returned from calling set()
. This is useful if you need to convert a predefined iterable to a set to get rid of duplicate values for your use but leave the original dataset for others to use as is.
{item for item in thing}
- The set comprehension creates a set then executes a for loop filling that set with items from that loop. This is useful if you want to define a set from scratch. This is especially true if you need to modify each item in the loop before adding it to the set, check a conditional before adding the item to the set, or both. For example: cube_even_vals = {num**3 for num in thing if num % 2 ==0}
loops through an iterable called thing and if the item is even cubes that value, and then adds it to the set.
Upvotes: 0
Reputation: 4278
While a set
may be what you want structure-wise, the question is what is faster. A list is faster. Your example code doesn't accurately compare set
vs list
because you're converting from a list to a set in set_loop
, and then you're creating the list
you'll be looping through in list_loop
. The set and list you iterate through should be constructed and in memory ahead of time, and simply looped through to see which data structure is faster at iterating:
ids_list = range(1000000)
ids_set = set(ids)
def f(x):
for i in x:
pass
%timeit f(ids_set)
#1 loops, best of 3: 214 ms per loop
%timeit f(ids_list)
#1 loops, best of 3: 176 ms per loop
Upvotes: 8
Reputation: 14721
I the list is vary large looping two time over it will take a lot of time and more in the second time you are looping a set not a list and as we know iterating over a set is slower than list.
i think you need the power of generator
and set
.
def first_test():
def loop_one_time(my_list):
# create a set to keep the items.
iterated_items = set()
# as we know iterating over list is faster then list.
for value in my_list:
# as we know checking if element exist in set is very fast not
# metter the size of the set.
if value not in iterated_items:
iterated_items.add(value) # add this item to list
yield value
mylist = [3,1,5,2,4,4,1,4,2,5,1,3]
for v in loop_one_time(mylist):pass
def second_test():
mylist = [3,1,5,2,4,4,1,4,2,5,1,3]
s = set(mylist)
for v in s:pass
import timeit
print(timeit.timeit('first_test()', setup='from __main__ import first_test', number=10000))
print(timeit.timeit('second_test()', setup='from __main__ import second_test', number=10000))
out put:
0.024003583388435043
0.010424674188938422
Note: this technique order is guaranteed
Upvotes: 2
Reputation: 304137
set
is what you want, so you should use set
. Trying to be clever introduces subtle bugs like forgetting to add one tomax(mylist)
! Code defensively. Worry about what's faster when you determine that it is too slow.
range(min(mylist), max(mylist) + 1) # <-- don't forget to add 1
Upvotes: 10
Reputation: 646
For simplicity's sake: newList = list(set(oldList))
But there are better options out there if you'd like to get speed/ordering/optimization instead: http://www.peterbe.com/plog/uniqifiers-benchmark
Upvotes: 6
Reputation: 48536
Just use a set
. Its semantics are exactly what you want: a collection of unique items.
Technically you'll be iterating through the list twice: once to create the set, once for your actual loop. But you'd be doing just as much work or more with any other approach.
Upvotes: 39