Anders Solberg
Anders Solberg

Reputation: 53

Is it overall more efficient to convert list to set when checking if a value is in the set

I was reading up on some best practices for speed in python and found this which says:

Membership testing with sets and dictionaries is much faster, O(1), than searching sequences, O(n).

When testing "a in b", b should be a set or dictionary instead of a list or tuple.

But if say i have a list long_list and I want to find out if item list_item is in long_list like:

list_item in long_list

Would it under any circumstance be faster to do:

list_item in Set(long_list)

Seeing as I think list to set or dict conversion on average should be O(n) in itself. (?)

Or is it always better to just go with whichever data-type I'm already working with?

Upvotes: 2

Views: 1377

Answers (2)

Kelly Bundy
Kelly Bundy

Reputation: 27588

Would it under any circumstance be faster to do: list_item in Set(long_list)

Yes, here's one where it's about 240 times faster:

from timeit import timeit

# Setup the circumstance
b = ['c' * 100000 + chr(i) for i in range(100)]
set(b)
a = b[-1]

# Measure
for _ in range(3):
    print(timeit(lambda: a in b,      number=1000))
    print(timeit(lambda: a in set(b), number=1000))
    print()

Output:

1.3284053
0.005440200000000006

1.3530345000000001
0.005345699999999898

1.3339443000000002
0.0056618999999997754

That first set(b) during setup of the circumstance makes the strings compute and store their hashes. You could do for s in b: hash(s) instead, I was just lazy.

CPython version used: Python 3.8.1 (tags/v3.8.1:1b293b6, Dec 18 2019, 22:39:24) [MSC v.1916 32 bit (Intel)] on win32

Results on repl.it with its current Python version 3.8.1 64-bit (and I think it's CPython on Linux):

1.2134423210000023
0.0042260629998054355

1.268552630999693
0.005732013999931951

1.1268463759997758
0.003574737000235473

Update

Here's a case where the set version is more than a million times faster for a single check, and you can increase the factor as much as you want by slowing down comparisons further:

from timeit import timeit

setup = '''
from time import sleep
class C:
    def __init__(self, value):
        self.value = value
    def __hash__(self):
        return hash(self.value)
    def __eq__(self, other):
        sleep(10)
        return self.value == other.value
b = [C(1), C(2)]
a = b[-1]
'''

for _ in range(3):
    for stmt in 'a in b', 'a in set(b)':
        t = timeit(stmt, setup, number=1)
        print('%10.7f' % t, 'seconds for: ', stmt)
    print()

Output:

10.0001129 seconds for:  a in b
 0.0000056 seconds for:  a in set(b)

10.0008811 seconds for:  a in b
 0.0000071 seconds for:  a in set(b)

10.0005529 seconds for:  a in b
 0.0000062 seconds for:  a in set(b)

Upvotes: 0

0x5453
0x5453

Reputation: 13589

If you are going to be doing multiple lookups on long_list, it is worth it. Otherwise, it is not.

$ python3 -m timeit -s 'x = list(range(10000))' '1234 in x'
100000 loops, best of 3: 5.71 usec per loop

$ python3 -m timeit -s 'x = list(range(10000))' '1234 in set(x)'
10000 loops, best of 3: 61.4 usec per loop

$ python3 -m timeit -s 'x = set(list(range(10000)))' '1234 in x'
10000000 loops, best of 3: 0.0198 usec per loop

Upvotes: 3

Related Questions