Won Oh
Won Oh

Reputation: 336

How does Python determine if two sets are equal? Is there any possible optimization for this?

I understand implementing set equality is easy if we just do bare-bone iteration

def __eq__(self, b):
    if len(self) != len(b): # Cut fast optimization
        return False
    for x in self:
        if x not in b:
            return False
    for x in b:
        if x not in a:
            return False
    return True

Is there a more efficient way to implement this? Is there any optimization Python performs (or possible optimization for computing non-colliding Hash) for these equality?

Upvotes: 3

Views: 1552

Answers (1)

user7661619
user7661619

Reputation:

CPython's Implementation

Built-in optimizations for set equality depend on the implementation of Python that is being used, which is CPython in most cases. You can check out the internal definition of set objects and how set comparisons are implemented by looking at CPython's source code.

First of all, it is important to note that Python features the classes Set and FrozenSet. The main difference is that instances of FrozenSet are immutable, and therefore are hashable and can be used as hash keys; the hash member is only calculated for instances of FrozenSet (see struct definition). They largely share the same codebase in CPython. CPython uses the following steps to determine set equality:

  1. compare sizes of sets
  2. compare hash values of sets (FrozenSet only!)
  3. is set v a subset of w?

Hashes may collide, hence why the other steps are still necessary when comparing instances of FrozenSet.

CPython's Optimizations

A typical baseline implementation would check that both sets are subsets of each other. Your example also features a very intuitive optimization by comparing the sizes of the sets to determine set inequality.

CPython implements the optimization that Arya McCarthy and Tim Peters already pointed out: comparing sizes (step 1) and checking that one set is a subset of the other set (step 3) are enough to determine equality. This cuts down the time needed to determine set equality by half.

CPython also implements another optimization for instances of FrozenSet by comparing their hash values, but those values may collide. Therefore, we can only rely on mismatching values as an indicator that two sets are not equal, at a cost of O(1). However, matching hash values do not imply that two sets are equal, because collisions are possible.

I don't know if the extra overhead introduced by (re)calculating hash values for mutable sets is worth it, but perhaps CPython's implementation could benefit from a solution where the hash values are only calculated/used internally to determine equality, when needed. This is where mutable generic sets could potentially benefit from a hash-based optimization.

Non-Colliding Hashes

You mentioned an optimization strategy based on non-colliding hashes, but as far as I know, non-colliding hashes and generic sets cannot go together, regardless of implementation. (Here's a link to a post on crypto.stackexchange.com that explains the problem pretty well.)

If the use of non-colliding hashes was possible for sets, then the only step needed to determine the equality of two sets would be to compare the hash values of both sets. This would be a massive optimization, as it would reduce the cost of determining set equality to O(1) in all cases. Steps 1 and 3 of CPython's implementations would not be necessary anymore.

That being said, the use of non-colliding hash values could still be possible for specific problem domains, where the contents of sets are constrained in some way, such that a hash function exists that promises that hash values do not collide. In that case, you probably should not rely on Python's built-in sets, or implement your own specialized set in Python (to avoid all that extra overhead). Instead, your best option would be to implement your own specialized implementation of a set in C/C++, and use Python bindings to integrate your specialized implementation into Python.

Upvotes: 1

Related Questions