Reputation: 34016
Seems that checking the dict keys as set is a tad faster:
import random
import string
import timeit
repeat = 3
numbers = 1000
def time(statement, _setup=None):
print min(
timeit.Timer(statement, setup=_setup or setup).repeat(
repeat, numbers))
random.seed('slartibartfast')
# Integers
length = 100000
d = {}
for _ in range(length):
d[random.randint(0, 10000000)] = 0
s = set(d)
setup = """from __main__ import s, d, length
"""
time('for i in xrange(length): check = i in d')
time('for i in xrange(length): check = i in s')
# Strings
d = {}
for _ in range(length):
d[''.join(random.choice(string.ascii_uppercase) for __ in range(16))] = 0
s = set(d)
test_strings= []
for _ in range(length):
test_strings.append(random.choice(string.ascii_uppercase) for __ in range(16))
setup = """from __main__ import s, d, length, test_strings
"""
time('for i in test_strings: check = i in d')
time('for i in test_strings: check = i in s')
prints something like:
10.1242966769
9.73939713014
10.5156763102
10.2767765061
Is this to be expected or a random artifact ?
Wondering if it's worth while to create sets for dict keys in performance intensive code.
Edit: my measurements really made me wonder on underlying implementation, I am not trying to save microseconds, I am just curious - and yes if it turns out that underlying implementation really favors sets I could make a set of those dict keys - or not (I am actually patching legacy code).
Upvotes: 4
Views: 3544
Reputation: 13
As mentioned in the comments of setobject.c
in the cpython
repo, sets are optimized for both found and not-found cases, making them a better option for checking membership.
Use cases for sets differ considerably from dictionaries where looked-up keys are more likely to be present. In contrast, sets are primarily about membership testing where the presence of an element is not known in advance. Accordingly, the set implementation needs to optimize for both the found and not-found case.
Upvotes: 0
Reputation: 8596
Honestly it depends heavily on hardware, OS, and data size/constraints. In general performance will be almost identical until you get really big data sizes. Note a few runs here where the dict
does marginally better. At larger data structure sizes the internal implementation details start to dominate differences and on my machine set
tends to perform significantly better.
The reality is under most situations the delta doesn't matter. If you really want better lookup performance consider moving to C level operations with cython
or ctypes
, or make use of library implementations designed for larger data sizes. Python base types are not meant for performance when reaching above a few million elements.
>>> # With empty dict as setup in question
>>> time('for i in xrange(length): check = i in d')
2.83035111427
>>> time('for i in xrange(length): check = i in s')
2.87069892883
>>> d = { random.random(): None for _ in xrange(100000) }
>>> s = set(d)
>>> time('for i in xrange(length): check = i in d')
3.84766697884
>>> time('for i in xrange(length): check = i in s')
3.97955989838
>>> d = { random.randint(0, 1000000000): None for _ in xrange(100000) }
>>> s = set(d)
>>> time('for i in xrange(length): check = i in d')
3.96871709824
>>> time('for i in xrange(length): check = i in s')
3.62110710144
>>> d = { random.randint(0, 1000000000): None for _ in xrange(10000000) }
>>> s = set(d)
>>> time('for i in xrange(length): check = i in d')
10.6934559345
>>> time('for i in xrange(length): check = i in s')
5.7491569519
Upvotes: 4
Reputation: 14847
Probably depends on a variety of things. On my runs, dictionary lookups have been slightly faster, but not by enough to get excited about:
In [1]: import numpy as np
In [2]: d = {i: True for i in np.random.random(1000)}
In [3]: s = {i for i in np.random.random(1000)}
In [4]: checks = d.keys()[:500] + list(s)[:500]
In [5]: %timeit [k in d for k in checks]
10000 loops, best of 3: 83 µs per loop
In [6]: %timeit [k in s for k in checks]
10000 loops, best of 3: 88.4 µs per loop
In [7]: d = {i: True for i in np.random.random(100000)}
In [8]: s = {i for i in np.random.random(100000)}
In [9]: checks = d.keys()[:5000] + list(s)[:5000]
In [10]: %timeit [k in d for k in checks]
1000 loops, best of 3: 865 µs per loop
In [11]: %timeit [k in s for k in checks]
1000 loops, best of 3: 929 µs per loop
Upvotes: 1