Joost
Joost

Reputation: 4134

Why would anyone check 'x in list'?

In Python one can very easily check if a value is contained in a container by using the in-operator. I was wondering why anyone would ever use the in-operator on a list, though, when it's much more efficient to first transform the list to a set as such:

if x in [1,2,3]:

as opposed to

if x in set([1,2,3]):

When looking at the time complexity, the first one has O(n) while the second one is superior at O(1). Is the only reason to use the first one the fact that it's more readable and shorter to write? Or is there a special case in which it's more practical to use? Why did the Python devs not implement the first one by first translating it to the second one? Would this not grand both of them the O(1) complexity?

Upvotes: 4

Views: 452

Answers (6)

David Robinson
David Robinson

Reputation: 78610

if x in set([1,2,3]):

is not faster than

if x in [1,2,3]:

Converting a list to a set requires iterating over the list, and is thus at least O(n) time.* In practice it takes a lot longer than searching for an item, since it involves hashing and then inserting every item.

Using a set is efficient when the set is converted once and then checked multiple times. Indeed, trying this by searching for 500 in the list range(1000) indicates that the tradeoff occurs once you are checking at least 3 times:

import timeit

def time_list(x, lst, num):
    for n in xrange(num):
        x in lst

def time_turn_set(x, lst, num):
    s = set(lst)
    for n in xrange(num):
        x in s

for num in range(1, 10):
    size = 1000
    setup_str = "lst = range(%d); from __main__ import %s"
    print num,
    print timeit.timeit("time_list(%d, lst, %d)" % (size / 2, num),
                        setup=setup_str % (size, "time_list"), number=10000),
    print timeit.timeit("time_turn_set(%d, lst, %d)" % (size / 2, num),
                        setup=setup_str % (size, "time_turn_set"), number=10000)

gives me:

1 0.124024152756 0.334127902985
2 0.250166893005 0.343378067017
3 0.359009981155 0.356444835663
4 0.464100837708 0.38081407547
5 0.600295066833 0.34722495079
6 0.692923069 0.358560085297
7 0.787877082825 0.338326931
8 0.877299070358 0.344762086868
9 1.00078821182 0.339591026306

Tests with list sizes ranging from 500 to 50000 give roughly the same result.

* Indeed, in the true asymptotic sense inserting into a hash table (and, for that matter, checking a value) is not O(1) time, but rather a constant speedup of linear O(n) time (since if the list gets too large collisions will build up). That would make the set([1,2,3]) operation be in O(n^2) time rather than O(n). However, in practice, with reasonable sized lists with a good implementation, you can basically always assume insertion and lookup of a hash table to be O(1) operations.

Upvotes: 17

pyrrrat
pyrrrat

Reputation: 359

Let's just try it…

import cProfile

We choose a big enough test range so we can actually measure something. 2**13 is just some random value.

test = range(2**13)
runs = len(test)
wcn  = runs - 1 # worst case n

The amount of test runs is equal to the amount of numbers in the list so we can have a nice average value in the end. wcn is the worst case, because it's the last entry in the list, so it's the last entry the algorithm will check.

def simple():
    for n in range(runs):
        if n in test:
            pass

def simpleWorstCase():
    for n in range(runs):
        if wcn in test:
            pass

def slow():
    for n in range(runs):
        if n in set(test):
            pass

Result for our simple test:

cProfile.run('simple()')
"""
         4 function calls in 0.794 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.794    0.794 <string>:1(<module>)
        1    0.794    0.794    0.794    0.794 profile.py:6(simple)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {range}
"""

Result for our simple worst case test:

cProfile.run('simpleWorstCase()')
"""
         4 function calls in 1.462 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    1.462    1.462 <string>:1(<module>)
        1    1.462    1.462    1.462    1.462 profile.py:12(simpleWorstCase)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {range}
"""

Result for the test where we convert to a set first:

cProfile.run('slow()')
"""
         4 function calls in 2.227 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    2.227    2.227 <string>:1(<module>)
        1    2.227    2.227    2.227    2.227 profile.py:11(slow)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {range}
"""

Upvotes: 0

NPE
NPE

Reputation: 500635

Let's test your assumptions:

In [19]: %timeit 1 in [1, 2, 3]
10000000 loops, best of 3: 52.3 ns per loop

In [20]: %timeit 4 in [1, 2, 3]
10000000 loops, best of 3: 118 ns per loop

In [21]: %timeit 1 in set([1, 2, 3])
1000000 loops, best of 3: 552 ns per loop

In [22]: %timeit 4 in set([1, 2, 3])
1000000 loops, best of 3: 558 ns per loop

Thus, in your exact example using set() is anywhere between 5 and 10 times slower than using the list.

Just creating the set takes 517 ns:

In [23]: %timeit set([1, 2, 3])
1000000 loops, best of 3: 517 ns per loop

Let's factor the creation of the set out of the check:

In [26]: s = set([1, 2, 3])

In [27]: %timeit 1 in s
10000000 loops, best of 3: 72.5 ns per loop

In [28]: %timeit 4 in s
10000000 loops, best of 3: 71.4 ns per loop

This makes the performance difference not as clear cut. Now the relative performance of list and set depends on the exact values presented to in. If they are present in the list and are close to the beginning of the list, then list probably wins. Otherwise, set probably wins.

Of course, if the right-hand side of in was larger, the conclusions would be very different.

Bottom line:

  1. Don't optimize prematurely.
  2. Always profile on realistic inputs before optimizing.

Upvotes: 3

Zarkonnen
Zarkonnen

Reputation: 22478

If you want to do micro-optimisations, you must measure:

l.py:
for x in range(1000000):
    3 in [1, 2, 3]

s.py:
for x in range(1000000):
    3 in set([1, 2, 3])

~/py $ time python l.py

real    0m0.314s
user    0m0.275s
sys 0m0.030s

~/py $ time python s.py

real    0m1.055s
user    0m1.006s
sys 0m0.029s

Upvotes: 2

nameless
nameless

Reputation: 1979

Because transforming a list to a set requires looping through the entire list, which is equivalent in complexity with testing if a list contains a value.

So testing if a value is in a set is faster only the set is already constructed.

Upvotes: 0

impl
impl

Reputation: 803

In order to convert a list to a set, you need to iterate through the elements of the list, which takes O(n) time at best. I believe Python sets are backed by hash maps, which means that they actually have a worst-case time complexity of O(n) for lookup operations as well.

Their wiki seems to agree with this.

Upvotes: 0

Related Questions