Reputation: 1439
I need to subclass set
so I subclassed collections.abc.Set
, as suggested here: https://stackoverflow.com/a/6698723/211858.
Please find my simple implementation below.
It essentially wraps a set of integers.
I generate list of 10,000 MySet
instances consisting of 100 random integers.
I would like to take the union of these wrapped sets.
I have two implementations below.
For some reason, the first using update
is very fast, yet the second using |=
is slow.
The tqdm
wrapper is to conduct nonrigorous benchmarks.
Is there some way to correct the definition of the class to fix this performance issue?
Thanks!
I'm on Python 3.10.5.
from collections.abc import Iterable, Iterator, Set
from tqdm import tqdm
class MySet(Set):
def __init__(self, integers: Iterable[int]) -> None:
self.data: set[int] = set(integers)
def __len__(self) -> int:
return len(self.data)
def __iter__(self) -> Iterator[int]:
return iter(self.data)
def __contains__(self, x: object) -> bool:
if isinstance(x, int):
return x in self.data
else:
raise NotImplemented
def my_func(self):
...
def my_other_func(self):
...
# %%
import random
# Make some mock data
my_sets: list[MySet] = [
MySet(random.sample(range(1_000_000), 100)) for _ in range(10_000)
]
# %%
universe: set[int] = set()
universe2: set[int] = set()
# %%
# Nearly instant
for my_set in tqdm(my_sets):
universe.update(my_set)
# %%
# Takes well over 5 minutes on my laptop
for my_set in tqdm(my_sets):
universe2 |= my_set
Upvotes: 1
Views: 524
Reputation: 7736
Conclusion: The way to add the least code is to implement the __ior__
method.
What happens when there is no implementation:
universe2
is set
and my_set
is MySet
, set
cannot recognize the MySet
class, so the binary inplace or operation will degenerate into a binary or operation.set
will fail, so Python will try to call the __ror__
method of MySet
.MySet
has no __ror__
method, Python will fall back to the collections.abc.Set
. The __ror__
method of it is the same as the __or__
method and returns the result of type MySet
. You can find it in the _collections_abc.py file:class Set(Collection):
...
@classmethod
def _from_iterable(cls, it):
'''Construct an instance of the class from any iterable input.
Must override this method if the class constructor signature
does not accept an iterable for an input.
'''
return cls(it)
...
def __or__(self, other):
if not isinstance(other, Iterable):
return NotImplemented
chain = (e for s in (self, other) for e in s)
return self._from_iterable(chain)
__ror__ = __or__
...
__ror__
operation changes universe2
to MySet
type and neither MySet
nor collections.abc.Set
has the __ior__
method, so the collections.abc.Set.__or__
function will be called repeatly, and a copy will be made per loop. This is the root cause of the slow speed of the second loop. Therefore, as long as the __ior__
method is implemented to avoid copying of subsequent operations, the performance will be greatly improved.Suggestions for better implementation: The abstract class collections.abc.Set
represents an immutable set. For this reason, it does not implement the inplace operation method. If you need your subclass to support inplace operation, you should consider inheriting collections.abc.MutableSet
and implementing the add
and discard
abstract methods. Mutableset
implements the inplace operation methods such as __ior__
through these two abstract methods (of course, it is still not efficient compared with the built-in set
, so it is better to implement them by yourself):
class MutableSet(Set):
...
def __ior__(self, it):
for value in it:
self.add(value)
return self
...
Correction: There are some mistakes in the old answers. Here we want to correct them. I hope those who have read the old answers can see here:
Mistacke 1:
If necessary, you can also implement the
__ior__
method, but it is not recommended to implement it separately when neither__or__
nor__ror__
methods are implemented, because Python will try to call the__ior__
method when it cannot find their implementation, which will make the behavior of non inplace operations become inplace operations, and may lead to unexpected results.
Correction: the binary or operation does not call the __ior__
method when the __or__
and __ror__
are missing.
Mistacke 2:
Generally speaking, binary operations between instances of different types may expect to get the type results of left operands, such as
set
andfrozenset
:>>> {1} | frozenset({2}) {1, 2} >>> frozenset({2}) | {1} frozenset({1, 2})
Correction: This is not always true. For example, the __ror__
operation of collections.abc.Set
also returns its subtype instead of the type of the left operand.
Upvotes: 3