DJ Wolfson
DJ Wolfson

Reputation: 171

Python3 Does input order matter for the .intersection() function in terms of runtime?

Lets say you have two sets, set1 is very large (a couple million values), and set2 is relatively small (a couple hundred thousand values). If I wanted to get the intersection of values between these two sets using the .interstion() function, would there be a runtime improvement based on the order of the inputs?

For example would one of these run faster than the other?

set1.intersection(set2)
set2.intersection(set1)

Upvotes: 5

Views: 396

Answers (1)

xavc
xavc

Reputation: 1295

No, input order does not matter. In CPython (the standard Python implementation), the set_intersection function handles set intersection. In the case where the other argument is also a set, the function will swap the two sets such that the smaller one is iterated through while the larger set is used for (constant time) lookup, as Booboo described:

        if (PySet_GET_SIZE(other) > PySet_GET_SIZE(so)) {
            tmp = (PyObject *)so;
            so = (PySetObject *)other;
            other = tmp;
        }

        while (set_next((PySetObject *)other, &pos, &entry)) {
            key = entry->key;
            hash = entry->hash;
            rv = set_contains_entry(so, key, hash);
            if (rv < 0) {
                Py_DECREF(result);
                return NULL;
            }
            if (rv) {
                if (set_add_entry(result, key, hash)) {
                    Py_DECREF(result);
                    return NULL;
                }
            }
        }

Thus, where set1 and set2 are sets, set1.intersect(set2) and set2.intersect(set1) will have the same performance. A small empirical test with timeit agrees:

import random
import string
import timeit

big_set = set()
while len(big_set) < 1000000:
    big_set.add(''.join(random.choices(string.ascii_letters, k=6)))


small_set = set()
while len(small_set) < 10000:
    small_set.add(''.join(random.choices(string.ascii_letters, k=6)))

print("Timing...")
print(f"big_set.intersection(small_set): {min(timeit.Timer(lambda: big_set.intersection(small_set)).repeat(31, 500))}")
print(f"small_set.intersection(big_set): {min(timeit.Timer(lambda: small_set.intersection(big_set)).repeat(31, 500))}")

Upvotes: 4

Related Questions