Mr_and_Mrs_D
Mr_and_Mrs_D

Reputation: 34016

Check for membership in a dict vs a set in python

Seems that checking the dict keys as set is a tad faster:

import random
import string
import timeit

repeat = 3
numbers = 1000

def time(statement, _setup=None):
    print min(
        timeit.Timer(statement, setup=_setup or setup).repeat(
            repeat, numbers))

random.seed('slartibartfast')

# Integers
length = 100000
d = {}
for _ in range(length):
    d[random.randint(0, 10000000)] = 0
s = set(d)

setup = """from __main__ import s, d, length
"""

time('for i in xrange(length): check = i in d')
time('for i in xrange(length): check = i in s')

# Strings
d = {}
for _ in range(length):
    d[''.join(random.choice(string.ascii_uppercase) for __ in range(16))] = 0
s = set(d)

test_strings= []
for _ in range(length):
    test_strings.append(random.choice(string.ascii_uppercase) for __ in range(16))

setup = """from __main__ import s, d, length, test_strings
"""

time('for i in test_strings: check = i in d')
time('for i in test_strings: check = i in s')

prints something like:

10.1242966769
9.73939713014
10.5156763102
10.2767765061

Is this to be expected or a random artifact ?

Wondering if it's worth while to create sets for dict keys in performance intensive code.

Edit: my measurements really made me wonder on underlying implementation, I am not trying to save microseconds, I am just curious - and yes if it turns out that underlying implementation really favors sets I could make a set of those dict keys - or not (I am actually patching legacy code).

Upvotes: 4

Views: 3544

Answers (3)

Jim Mim
Jim Mim

Reputation: 13

As mentioned in the comments of setobject.c in the cpython repo, sets are optimized for both found and not-found cases, making them a better option for checking membership.

Use cases for sets differ considerably from dictionaries where looked-up keys are more likely to be present. In contrast, sets are primarily about membership testing where the presence of an element is not known in advance. Accordingly, the set implementation needs to optimize for both the found and not-found case.

setobject.c#L27

Upvotes: 0

Pyrce
Pyrce

Reputation: 8596

Honestly it depends heavily on hardware, OS, and data size/constraints. In general performance will be almost identical until you get really big data sizes. Note a few runs here where the dict does marginally better. At larger data structure sizes the internal implementation details start to dominate differences and on my machine set tends to perform significantly better.

The reality is under most situations the delta doesn't matter. If you really want better lookup performance consider moving to C level operations with cython or ctypes, or make use of library implementations designed for larger data sizes. Python base types are not meant for performance when reaching above a few million elements.

>>> # With empty dict as setup in question
>>> time('for i in xrange(length): check = i in d')
2.83035111427
>>> time('for i in xrange(length): check = i in s')
2.87069892883
>>> d = { random.random(): None for _ in xrange(100000) }
>>> s = set(d)
>>> time('for i in xrange(length): check = i in d')
3.84766697884
>>> time('for i in xrange(length): check = i in s')
3.97955989838
>>> d = { random.randint(0, 1000000000): None for _ in xrange(100000) }
>>> s = set(d)
>>> time('for i in xrange(length): check = i in d')
3.96871709824
>>> time('for i in xrange(length): check = i in s')
3.62110710144
>>> d = { random.randint(0, 1000000000): None for _ in xrange(10000000) }
>>> s = set(d)
>>> time('for i in xrange(length): check = i in d')
10.6934559345
>>> time('for i in xrange(length): check = i in s')
5.7491569519

Upvotes: 4

Randy
Randy

Reputation: 14847

Probably depends on a variety of things. On my runs, dictionary lookups have been slightly faster, but not by enough to get excited about:

In [1]: import numpy as np

In [2]: d = {i: True for i in np.random.random(1000)}

In [3]: s = {i for i in np.random.random(1000)}

In [4]: checks = d.keys()[:500] + list(s)[:500]

In [5]: %timeit [k in d for k in checks]
10000 loops, best of 3: 83 µs per loop

In [6]: %timeit [k in s for k in checks]
10000 loops, best of 3: 88.4 µs per loop

In [7]: d = {i: True for i in np.random.random(100000)}

In [8]: s = {i for i in np.random.random(100000)}

In [9]: checks = d.keys()[:5000] + list(s)[:5000]

In [10]: %timeit [k in d for k in checks]
1000 loops, best of 3: 865 µs per loop

In [11]: %timeit [k in s for k in checks]
1000 loops, best of 3: 929 µs per loop

Upvotes: 1

Related Questions