Reputation: 2563
recently i saw a question here on python here
1: Python If statement and logical operator issue. There was someone in comments who gave an answer that it can be done like this:
1 in (1, 2, 3)
to check if 1 is in present in a collection of items.but according to me this should be much faster 1 in {1, 2, 3}
. as you can see there in discussions ,someone of high reputation goes on saying that ( )
is faster for fixed size input. and has faster lookup than { }
. i am asking it here because i want to know for my own understanding which one is correct and also i do not get the idea of ( )
being fiexd-size
or variable size
. i just asked for a refrence in that original question so i can correct myself if i am wrong but the user insits upon clearing my basics of computer science knowledge without givin g a single refrence to his argument that lookup in Tuple is O(1)
.so i am asking it here.
Upvotes: 2
Views: 3627
Reputation: 2261
The case for set has been improved in python 3.5 (perhaps earlier). Here is my test function:
def is_valid(x):
return x in ('name', 'known_as')
The code generated is:
2 0 LOAD_FAST 0 (x)
3 LOAD_CONST 3 (('name', 'known_as'))
6 COMPARE_OP 6 (in)
9 RETURN_VALUE
Changing the tuple to a set generates this code:
2 0 LOAD_FAST 0 (x) 3 LOAD_CONST 3 (frozenset({'name', 'known_as'})) 6 COMPARE_OP 6 (in) 9 RETURN_VALUE
Python 3.7 and 3.8 generate the same code for both cases. The timing results are:
$ python3.5 -m timeit -s "def is_valid(x): return x in {'name', 'known_as'}" "is_valid('')" 10000000 loops, best of 3: 0.0815 usec per loop $ python3.5 -m timeit -s "def is_valid(x): return x in ('name', 'known_as')" "is_valid('')" 10000000 loops, best of 3: 0.0997 usec per loop
Passing 'name' to is_valid for the tuple case runs in 0.0921 usec per loop, still slower than set.
Upvotes: 4
Reputation: 155036
Although the other commenter is technically correct that x in <constant expression>
is O(1) at run-time, it is still interesting to compare performance of sets and tuples of the same size. The discussion can be generalized it to consider different programs containing expressions of the form x in {a, b, c, ...}
. If the expression consists of n
items, then n
can be considered input for big-O analysis, among all possible such programs. (And if still one insisted that it be possible to provide a different n
at run-time, simply imagine that the function is created using exec
.)
The issue with performance of such expressions is that the Python run-time must construct a one-time set for use in the in
test, and then immediately discard it. This is clearly visible in the generated assembly:
>>> import dis
>>> def is_valid(x):
... return x in {1, 2, 3}
...
>>> dis.dis(is_valid)
2 0 LOAD_FAST 0 (x)
3 LOAD_CONST 1 (1)
6 LOAD_CONST 2 (2)
9 LOAD_CONST 3 (3)
12 BUILD_SET 3
15 COMPARE_OP 6 (in)
18 RETURN_VALUE
Constructing a set of n
elements obviously has a cost of at least O(n). In other words, the test using a set, implemented as a literal constant, is not O(1) because the interpreter has to construct the set. This is what the commenter was trying to get across by referring to the cost of construction.
In fact, it gets stranger than that; due to nature of the Python VM, the compiler is allowed to construct tuples consisting only of numeric literals at compile time, which it does:
>>> import dis
>>> def is_valid(x):
... return x in (1, 2, 3)
...
>>> dis.dis(is_valid)
2 0 LOAD_FAST 0 (x)
3 LOAD_CONST 4 ((1, 2, 3))
6 COMPARE_OP 6 (in)
9 RETURN_VALUE
Note how the (1, 2, 3)
constant doesn't need to be built item by item - that is because it was already built by the compiler and inserted into the function's environment. As a result, this implementation of is_valid
might actually be faster than the one using a set! This is easy to test:
$ python -m timeit -s 'def is_valid(x): return x in {1, 2, 3}' 'is_valid(-1)'
10000000 loops, best of 3: 0.189 usec per loop
$ python -m timeit -s 'def is_valid(x): return x in (1, 2, 3)' 'is_valid(-1)'
10000000 loops, best of 3: 0.128 usec per loop
Again, the other commenter was right.
Increasing the size of the set/tuple doesn't tip the balance in favor of set - it is always more expensive to construct a set of n
items and then perform a blindingly-fast constant-time search, than to just iterate over a pre-created tuple looking for an item. This is because set creation has to allocate the set (possibly several times) and calculate hashes of all the items. Although both tuple search and set size are O(n), the set's one has a larger constant factor.
The correct way to implement an O(1) lookup would entail manually implementing the optimization that the compiler does automatically for tuples:
_valid = {1, 2, 3}
def is_valid(x):
return x in _valid
Comparing this code with the equivalent code using a tuple, the set is always faster, even with a small number of items. As the number of items grows, set becomes the clear winner with its O(1) lookup.
Upvotes: 3
Reputation: 280963
When you say something like O(n)
, you have to say what n
is. Here, n
is the length of the tuple... but the tuple is not an input. You're not taking the tuple as an argument or anything. n
is always 2
in the conversation you linked, or 3
for your example tuple, so for this particular n
, O(n)
is the same thing as O(2)
, or O(1)
.
As you may have noticed by now, it doesn't make much sense to talk about O(n)
when n
is a constant. If you had a function like
def in_(element, tup):
return element in tup
you could say that the runtime is O(n)
element comparisons, where n
is len(tup)
, but for something like
usr in ('Y', 'y')
talking about n
isn't very useful.
Upvotes: 6