Reputation: 71
set_1 = {1, 2, 3, 4, 5, 6, 7}
set_2 = {2, 4, 6, 8, 10, 12}
intersec = set_1 & set_2
What happens under the hood when &
is used between two sets? What is it's time complexity?
Upvotes: 4
Views: 1846
Reputation: 51132
The set
data structure, and its operations, are implemented in C. The source code can be found on GitHub, in the file Objects/setobject.c
. For the benefit of those who don't write C, here is a rough translation of how it works in Python:
# helper function; only self needs to be a set
def set_intersection(self, other):
if self is other:
return self.copy()
result = set()
if isinstance(other, (set, frozenset)):
if len(other) > len(self):
self, other = other, self
for item in other:
if item in self:
result.add(item)
return result
# when using the & operator; self and other both must be sets
def set_and(self, other):
if not isinstance(self, (set, frozenset)) or not isinstance(other, (set, frozenset)):
return NotImplemented
return set_intersection(self, other)
# when using the .intersection method, which accepts multiple "other" arguments
def set_intersection_multi(self, *others):
if not others:
return self.copy()
result = self
for other in others:
result = set_intersection(result, other)
return result
This isn't a literal translation; the C code has one loop for the case where other
is a set, and another loop when it isn't, because in the former case the item hashes are already known and don't need to be recomputed. There's no way to translate that detail, so both loops in C are translatable to the same loop in Python, and I simplified it. Another untranslatable difference is that in C, if self
is a frozenset
then the result will be too; but Python code can't use .add
in that case.
Because testing membership (the in
operator) of a set takes O(1) expected time, the expected time complexity of the intersection is O(min(m, n)) when both are sets, where m and n are the lengths of self
and other
.
When other
is not a set, the expected time complexity of the intersection is equal to the time complexity of iterating over other
, plus the time complexity of computing the hashes of all the items in other
. This will typically be O(n), but theoretically could be higher.
Upvotes: 10