Reputation: 171
Lets say you have two sets, set1 is very large (a couple million values), and set2 is relatively small (a couple hundred thousand values). If I wanted to get the intersection of values between these two sets using the .interstion() function, would there be a runtime improvement based on the order of the inputs?
For example would one of these run faster than the other?
set1.intersection(set2)
set2.intersection(set1)
Upvotes: 5
Views: 396
Reputation: 1295
No, input order does not matter. In CPython (the standard Python implementation), the set_intersection
function handles set intersection. In the case where the other argument is also a set, the function will swap the two sets such that the smaller one is iterated through while the larger set is used for (constant time) lookup, as Booboo described:
if (PySet_GET_SIZE(other) > PySet_GET_SIZE(so)) {
tmp = (PyObject *)so;
so = (PySetObject *)other;
other = tmp;
}
while (set_next((PySetObject *)other, &pos, &entry)) {
key = entry->key;
hash = entry->hash;
rv = set_contains_entry(so, key, hash);
if (rv < 0) {
Py_DECREF(result);
return NULL;
}
if (rv) {
if (set_add_entry(result, key, hash)) {
Py_DECREF(result);
return NULL;
}
}
}
Thus, where set1
and set2
are sets, set1.intersect(set2)
and set2.intersect(set1)
will have the same performance. A small empirical test with timeit
agrees:
import random
import string
import timeit
big_set = set()
while len(big_set) < 1000000:
big_set.add(''.join(random.choices(string.ascii_letters, k=6)))
small_set = set()
while len(small_set) < 10000:
small_set.add(''.join(random.choices(string.ascii_letters, k=6)))
print("Timing...")
print(f"big_set.intersection(small_set): {min(timeit.Timer(lambda: big_set.intersection(small_set)).repeat(31, 500))}")
print(f"small_set.intersection(big_set): {min(timeit.Timer(lambda: small_set.intersection(big_set)).repeat(31, 500))}")
Upvotes: 4