Reputation: 4134
In Python one can very easily check if a value is contained in a container by using the in
-operator. I was wondering why anyone would ever use the in
-operator on a list, though, when it's much more efficient to first transform the list to a set as such:
if x in [1,2,3]:
as opposed to
if x in set([1,2,3]):
When looking at the time complexity, the first one has O(n) while the second one is superior at O(1). Is the only reason to use the first one the fact that it's more readable and shorter to write? Or is there a special case in which it's more practical to use? Why did the Python devs not implement the first one by first translating it to the second one? Would this not grand both of them the O(1) complexity?
Upvotes: 4
Views: 452
Reputation: 78610
if x in set([1,2,3]):
is not faster than
if x in [1,2,3]:
Converting a list to a set requires iterating over the list, and is thus at least O(n)
time.* In practice it takes a lot longer than searching for an item, since it involves hashing and then inserting every item.
Using a set is efficient when the set is converted once and then checked multiple times. Indeed, trying this by searching for 500
in the list range(1000)
indicates that the tradeoff occurs once you are checking at least 3 times:
import timeit
def time_list(x, lst, num):
for n in xrange(num):
x in lst
def time_turn_set(x, lst, num):
s = set(lst)
for n in xrange(num):
x in s
for num in range(1, 10):
size = 1000
setup_str = "lst = range(%d); from __main__ import %s"
print num,
print timeit.timeit("time_list(%d, lst, %d)" % (size / 2, num),
setup=setup_str % (size, "time_list"), number=10000),
print timeit.timeit("time_turn_set(%d, lst, %d)" % (size / 2, num),
setup=setup_str % (size, "time_turn_set"), number=10000)
gives me:
1 0.124024152756 0.334127902985
2 0.250166893005 0.343378067017
3 0.359009981155 0.356444835663
4 0.464100837708 0.38081407547
5 0.600295066833 0.34722495079
6 0.692923069 0.358560085297
7 0.787877082825 0.338326931
8 0.877299070358 0.344762086868
9 1.00078821182 0.339591026306
Tests with list sizes ranging from 500 to 50000 give roughly the same result.
* Indeed, in the true asymptotic sense inserting into a hash table (and, for that matter, checking a value) is not O(1)
time, but rather a constant speedup of linear O(n)
time (since if the list gets too large collisions will build up). That would make the set([1,2,3])
operation be in O(n^2)
time rather than O(n)
. However, in practice, with reasonable sized lists with a good implementation, you can basically always assume insertion and lookup of a hash table to be O(1)
operations.
Upvotes: 17
Reputation: 359
Let's just try it…
import cProfile
We choose a big enough test range so we can actually measure something. 2**13
is just some random value.
test = range(2**13)
runs = len(test)
wcn = runs - 1 # worst case n
The amount of test runs is equal to the amount of numbers in the list so we can have a nice average value in the end.
wcn
is the worst case, because it's the last entry in the list, so it's the last entry the algorithm will check.
def simple():
for n in range(runs):
if n in test:
pass
def simpleWorstCase():
for n in range(runs):
if wcn in test:
pass
def slow():
for n in range(runs):
if n in set(test):
pass
Result for our simple test:
cProfile.run('simple()')
"""
4 function calls in 0.794 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.794 0.794 <string>:1(<module>)
1 0.794 0.794 0.794 0.794 profile.py:6(simple)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 {range}
"""
Result for our simple worst case test:
cProfile.run('simpleWorstCase()')
"""
4 function calls in 1.462 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 1.462 1.462 <string>:1(<module>)
1 1.462 1.462 1.462 1.462 profile.py:12(simpleWorstCase)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 {range}
"""
Result for the test where we convert to a set first:
cProfile.run('slow()')
"""
4 function calls in 2.227 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 2.227 2.227 <string>:1(<module>)
1 2.227 2.227 2.227 2.227 profile.py:11(slow)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 {range}
"""
Upvotes: 0
Reputation: 500635
Let's test your assumptions:
In [19]: %timeit 1 in [1, 2, 3]
10000000 loops, best of 3: 52.3 ns per loop
In [20]: %timeit 4 in [1, 2, 3]
10000000 loops, best of 3: 118 ns per loop
In [21]: %timeit 1 in set([1, 2, 3])
1000000 loops, best of 3: 552 ns per loop
In [22]: %timeit 4 in set([1, 2, 3])
1000000 loops, best of 3: 558 ns per loop
Thus, in your exact example using set()
is anywhere between 5 and 10 times slower than using the list.
Just creating the set takes 517 ns:
In [23]: %timeit set([1, 2, 3])
1000000 loops, best of 3: 517 ns per loop
Let's factor the creation of the set out of the check:
In [26]: s = set([1, 2, 3])
In [27]: %timeit 1 in s
10000000 loops, best of 3: 72.5 ns per loop
In [28]: %timeit 4 in s
10000000 loops, best of 3: 71.4 ns per loop
This makes the performance difference not as clear cut. Now the relative performance of list
and set
depends on the exact values presented to in
. If they are present in the list and are close to the beginning of the list, then list
probably wins. Otherwise, set
probably wins.
Of course, if the right-hand side of in
was larger, the conclusions would be very different.
Bottom line:
Upvotes: 3
Reputation: 22478
If you want to do micro-optimisations, you must measure:
l.py:
for x in range(1000000):
3 in [1, 2, 3]
s.py:
for x in range(1000000):
3 in set([1, 2, 3])
~/py $ time python l.py
real 0m0.314s
user 0m0.275s
sys 0m0.030s
~/py $ time python s.py
real 0m1.055s
user 0m1.006s
sys 0m0.029s
Upvotes: 2
Reputation: 1979
Because transforming a list to a set requires looping through the entire list, which is equivalent in complexity with testing if a list contains a value.
So testing if a value is in a set is faster only the set is already constructed.
Upvotes: 0
Reputation: 803
In order to convert a list to a set, you need to iterate through the elements of the list, which takes O(n)
time at best. I believe Python sets are backed by hash maps, which means that they actually have a worst-case time complexity of O(n)
for lookup operations as well.
Their wiki seems to agree with this.
Upvotes: 0