hwong557
hwong557

Reputation: 1439

|= vs update with a subclass of collections.abc.Set

I need to subclass set so I subclassed collections.abc.Set, as suggested here: https://stackoverflow.com/a/6698723/211858. Please find my simple implementation below. It essentially wraps a set of integers.

I generate list of 10,000 MySet instances consisting of 100 random integers. I would like to take the union of these wrapped sets. I have two implementations below. For some reason, the first using update is very fast, yet the second using |= is slow. The tqdm wrapper is to conduct nonrigorous benchmarks. Is there some way to correct the definition of the class to fix this performance issue? Thanks!

I'm on Python 3.10.5.

from collections.abc import Iterable, Iterator, Set

from tqdm import tqdm


class MySet(Set):
    def __init__(self, integers: Iterable[int]) -> None:
        self.data: set[int] = set(integers)

    def __len__(self) -> int:
        return len(self.data)

    def __iter__(self) -> Iterator[int]:
        return iter(self.data)

    def __contains__(self, x: object) -> bool:
        if isinstance(x, int):
            return x in self.data
        else:
            raise NotImplemented

    def my_func(self):
        ...

    def my_other_func(self):
        ...


# %%
import random

# Make some mock data
my_sets: list[MySet] = [
    MySet(random.sample(range(1_000_000), 100)) for _ in range(10_000)
]

# %%
universe: set[int] = set()
universe2: set[int] = set()

# %%
# Nearly instant
for my_set in tqdm(my_sets):
    universe.update(my_set)

# %%
# Takes well over 5 minutes on my laptop
for my_set in tqdm(my_sets):
    universe2 |= my_set

Upvotes: 1

Views: 524

Answers (1)

Mechanic Pig
Mechanic Pig

Reputation: 7736

Conclusion: The way to add the least code is to implement the __ior__ method.

What happens when there is no implementation:

  1. When binary inplace or operation is performed for the first time, because universe2 is set and my_set is MySet, set cannot recognize the MySet class, so the binary inplace or operation will degenerate into a binary or operation.
  2. As in point 1, the binary or operation of set will fail, so Python will try to call the __ror__ method of MySet.
  3. Because MySet has no __ror__ method, Python will fall back to the collections.abc.Set. The __ror__ method of it is the same as the __or__ method and returns the result of type MySet. You can find it in the _collections_abc.py file:
class Set(Collection):
    ...
    @classmethod
    def _from_iterable(cls, it):
        '''Construct an instance of the class from any iterable input.

        Must override this method if the class constructor signature
        does not accept an iterable for an input.
        '''
        return cls(it)
    ...
    def __or__(self, other):
        if not isinstance(other, Iterable):
            return NotImplemented
        chain = (e for s in (self, other) for e in s)
        return self._from_iterable(chain)

    __ror__ = __or__
    ...
  1. For the subsequent binary inplace or operation, because the first __ror__ operation changes universe2 to MySet type and neither MySet nor collections.abc.Set has the __ior__ method, so the collections.abc.Set.__or__ function will be called repeatly, and a copy will be made per loop. This is the root cause of the slow speed of the second loop. Therefore, as long as the __ior__ method is implemented to avoid copying of subsequent operations, the performance will be greatly improved.

Suggestions for better implementation: The abstract class collections.abc.Set represents an immutable set. For this reason, it does not implement the inplace operation method. If you need your subclass to support inplace operation, you should consider inheriting collections.abc.MutableSet and implementing the add and discard abstract methods. Mutableset implements the inplace operation methods such as __ior__ through these two abstract methods (of course, it is still not efficient compared with the built-in set, so it is better to implement them by yourself):

class MutableSet(Set):
    ...
    def __ior__(self, it):
        for value in it:
            self.add(value)
        return self
    ...

Correction: There are some mistakes in the old answers. Here we want to correct them. I hope those who have read the old answers can see here:

Mistacke 1:

If necessary, you can also implement the __ior__ method, but it is not recommended to implement it separately when neither __or__ nor __ror__ methods are implemented, because Python will try to call the __ior__ method when it cannot find their implementation, which will make the behavior of non inplace operations become inplace operations, and may lead to unexpected results.

Correction: the binary or operation does not call the __ior__ method when the __or__ and __ror__ are missing.

Mistacke 2:

Generally speaking, binary operations between instances of different types may expect to get the type results of left operands, such as set and frozenset:

>>> {1} | frozenset({2})
{1, 2}
>>> frozenset({2}) | {1}
frozenset({1, 2})

Correction: This is not always true. For example, the __ror__ operation of collections.abc.Set also returns its subtype instead of the type of the left operand.

Upvotes: 3

Related Questions