Reputation: 53
I was reading up on some best practices for speed in python and found this which says:
Membership testing with sets and dictionaries is much faster, O(1), than searching sequences, O(n).
When testing "a in b", b should be a set or dictionary instead of a list or tuple.
But if say i have a list long_list
and I want to find out if item list_item
is in long_list
like:
list_item in long_list
Would it under any circumstance be faster to do:
list_item in Set(long_list)
Seeing as I think list to set or dict conversion on average should be O(n) in itself. (?)
Or is it always better to just go with whichever data-type I'm already working with?
Upvotes: 2
Views: 1377
Reputation: 27588
Would it under any circumstance be faster to do:
list_item in Set(long_list)
Yes, here's one where it's about 240 times faster:
from timeit import timeit
# Setup the circumstance
b = ['c' * 100000 + chr(i) for i in range(100)]
set(b)
a = b[-1]
# Measure
for _ in range(3):
print(timeit(lambda: a in b, number=1000))
print(timeit(lambda: a in set(b), number=1000))
print()
Output:
1.3284053
0.005440200000000006
1.3530345000000001
0.005345699999999898
1.3339443000000002
0.0056618999999997754
That first set(b)
during setup of the circumstance makes the strings compute and store their hashes. You could do for s in b: hash(s)
instead, I was just lazy.
CPython version used: Python 3.8.1 (tags/v3.8.1:1b293b6, Dec 18 2019, 22:39:24) [MSC v.1916 32 bit (Intel)] on win32
Results on repl.it with its current Python version 3.8.1 64-bit (and I think it's CPython on Linux):
1.2134423210000023
0.0042260629998054355
1.268552630999693
0.005732013999931951
1.1268463759997758
0.003574737000235473
Update
Here's a case where the set
version is more than a million times faster for a single check, and you can increase the factor as much as you want by slowing down comparisons further:
from timeit import timeit
setup = '''
from time import sleep
class C:
def __init__(self, value):
self.value = value
def __hash__(self):
return hash(self.value)
def __eq__(self, other):
sleep(10)
return self.value == other.value
b = [C(1), C(2)]
a = b[-1]
'''
for _ in range(3):
for stmt in 'a in b', 'a in set(b)':
t = timeit(stmt, setup, number=1)
print('%10.7f' % t, 'seconds for: ', stmt)
print()
Output:
10.0001129 seconds for: a in b
0.0000056 seconds for: a in set(b)
10.0008811 seconds for: a in b
0.0000071 seconds for: a in set(b)
10.0005529 seconds for: a in b
0.0000062 seconds for: a in set(b)
Upvotes: 0
Reputation: 13589
If you are going to be doing multiple lookups on long_list
, it is worth it. Otherwise, it is not.
$ python3 -m timeit -s 'x = list(range(10000))' '1234 in x'
100000 loops, best of 3: 5.71 usec per loop
$ python3 -m timeit -s 'x = list(range(10000))' '1234 in set(x)'
10000 loops, best of 3: 61.4 usec per loop
$ python3 -m timeit -s 'x = set(list(range(10000)))' '1234 in x'
10000000 loops, best of 3: 0.0198 usec per loop
Upvotes: 3