askewchan
askewchan

Reputation: 46530

Better/Faster to Loop through set or list?

If I have a python list that is has many duplicates, and I want to iterate through each item, but not through the duplicates, is it best to use a set (as in set(mylist), or find another way to create a list without duplicates? I was thinking of just looping through the list and checking for duplicates but I figured that's what set() does when it's initialized.

So if mylist = [3,1,5,2,4,4,1,4,2,5,1,3] and I really just want to loop through [1,2,3,4,5] (order doesn't matter), should I use set(mylist) or something else?

An alternative is possible in the last example, since the list contains every integer between its min and max value, I could loop through range(min(mylist),max(mylist)) or through set(mylist). Should I generally try to avoid using set in this case? Also, would finding the min and max be slower than just creating the set?


In the case in the last example, the set is faster:

from numpy.random import random_integers
ids = random_integers(1e3,size=1e6)

def set_loop(mylist):
    idlist = []
    for id in set(mylist):
        idlist.append(id)
    return idlist

def list_loop(mylist):
    idlist = []
    for id in range(min(mylist),max(mylist)):
        idlist.append(id)
    return idlist

%timeit set_loop(ids)
#1 loops, best of 3: 232 ms per loop

%timeit list_loop(ids)
#1 loops, best of 3: 408 ms per loop

Upvotes: 40

Views: 106128

Answers (6)

Scientifik
Scientifik

Reputation: 16

Go Back To Where The List Is Defined

If the list is part of a larger codebase and needs to be as is then all these responses are right you simply have to convert the list into a set with either the set() function or a set comprehnsion { item for item in list }. However if this is YOUR code base and you know that the list is only going to be used in THIS context you can go back to where the list is defined and define it as a set instead of a list. This means that you have a smaller dataset to begin with and would indeed be faster than taking a list, iterating through it, and then creating a set out of that.

Scope Matters

The other thing to consider here is whether or not that the dataset can be defined as a set comprehension as described above and if not then the scope of where the dataset needs to be a list and where it needs to be a set.

If you were to for example call set() on the list inside a loop. Then every time the loop would be called the list would be iterated through and a set would be created. If instead you created the set outside of the scope of the loop you would only iterate through the list once and create one set.

Set Comprehension vs set()

set() - Takes an iterable argument then returns a set version of that argument while not mutating the original. This means you end up with two datasets instead of one. The iterable that was passed to set() and the new set that is returned from calling set(). This is useful if you need to convert a predefined iterable to a set to get rid of duplicate values for your use but leave the original dataset for others to use as is.

{item for item in thing} - The set comprehension creates a set then executes a for loop filling that set with items from that loop. This is useful if you want to define a set from scratch. This is especially true if you need to modify each item in the loop before adding it to the set, check a conditional before adding the item to the set, or both. For example: cube_even_vals = {num**3 for num in thing if num % 2 ==0} loops through an iterable called thing and if the item is even cubes that value, and then adds it to the set.

Upvotes: 0

hamx0r
hamx0r

Reputation: 4278

While a set may be what you want structure-wise, the question is what is faster. A list is faster. Your example code doesn't accurately compare set vs list because you're converting from a list to a set in set_loop, and then you're creating the list you'll be looping through in list_loop. The set and list you iterate through should be constructed and in memory ahead of time, and simply looped through to see which data structure is faster at iterating:

ids_list = range(1000000)
ids_set = set(ids)
def f(x):
    for i in x:
         pass

%timeit f(ids_set)
#1 loops, best of 3: 214 ms per loop
%timeit f(ids_list)
#1 loops, best of 3: 176 ms per loop

Upvotes: 8

Charif DZ
Charif DZ

Reputation: 14721

I the list is vary large looping two time over it will take a lot of time and more in the second time you are looping a set not a list and as we know iterating over a set is slower than list.

i think you need the power of generator and set.

def first_test():

    def loop_one_time(my_list):
        # create a set to keep the items.
        iterated_items = set()
        # as we know iterating over list is faster then list.
        for value in my_list: 
            # as we know checking if element exist in set is very fast not
            # metter the size of the set.
            if value not in iterated_items:  
                iterated_items.add(value) # add this item to list
                yield value


    mylist = [3,1,5,2,4,4,1,4,2,5,1,3]

    for v in loop_one_time(mylist):pass



def second_test():
    mylist = [3,1,5,2,4,4,1,4,2,5,1,3]
    s = set(mylist)
    for v in s:pass


import timeit

print(timeit.timeit('first_test()', setup='from __main__ import first_test', number=10000))
print(timeit.timeit('second_test()', setup='from __main__ import second_test', number=10000))

out put:

   0.024003583388435043
   0.010424674188938422

Note: this technique order is guaranteed

Upvotes: 2

John La Rooy
John La Rooy

Reputation: 304137

set is what you want, so you should use set. Trying to be clever introduces subtle bugs like forgetting to add one tomax(mylist)! Code defensively. Worry about what's faster when you determine that it is too slow.

range(min(mylist), max(mylist) + 1)  # <-- don't forget to add 1

Upvotes: 10

GordonsBeard
GordonsBeard

Reputation: 646

For simplicity's sake: newList = list(set(oldList))

But there are better options out there if you'd like to get speed/ordering/optimization instead: http://www.peterbe.com/plog/uniqifiers-benchmark

Upvotes: 6

Eevee
Eevee

Reputation: 48536

Just use a set. Its semantics are exactly what you want: a collection of unique items.

Technically you'll be iterating through the list twice: once to create the set, once for your actual loop. But you'd be doing just as much work or more with any other approach.

Upvotes: 39

Related Questions