Davide_sd
Davide_sd

Reputation: 13135

Multiple elements in set with the same hash

I was reading through SymPy 1.13.0 release notes when an entry caught my attention (emphasis mine):

The hash function for Floats and expressions involving Float now respects the hash invariant that if a == b then hash(a) == hash(b). This ensures that it is possible to use such expressions in a guaranteed deterministic way in Python's core set and dict data structures. It is however not recommended to mix up sympy Basic objects with non-sympy number types such as core Python's float or int in the same set/dict.

I mixed up Python's numbers with SymPy's numbers in a set, and I can't explain what's going on (sympy 1.13.0).

I thought elements of a set all have different hashes. For example, in the following code block there are 4 strings, a, b, c1, c2. They are distinct objects in memory. However, c1 is equal to c2, so they have the same hash. Hence, set([a, b, c1, c2]) only contains three elements, as expected:

a, b = "a", "b"
c1 = "This is a test"
c2 = "This is a test"
t = [a, b, c1, c2]
print("ids:", ", ".join([str(id(e)) for e in t]))
print("hashes:", ", ".join([str(hash(e)) for e in t]))
s = set(t)
print(s)

# hashes: -8635426522778860779, 3774980000107278733, 3487163462126586929, 3487163462126586929
# ids: 11791568, 11792624, 135187170594160, 135187170598448
# {'This is a test', 'a', 'b'}

In the following exampe there are three different objects in memory, n1, n2, n3. They all generates the same hash. I expected the resulting set to only contain one element. Instead, it contains two elements, apparently sharing the same hash.

from sympy import *
n1 = 2.0
n2 = Float(2.0)
n3 = Float(2.0, precision=80)
t = [n1, n2, n3]
print("ids:", ", ".join([str(id(e)) for e in t]))
print("hashes:", ", ".join([str(hash(e)) for e in t]))
s = set(t)
print(s)

# ids: 135913432385776, 135912654307472, 135912654307632
# hashes: 2, 2, 2
# {2.0, 2.0000000000000000000000}

What is going on?

Upvotes: 0

Views: 114

Answers (3)

Oscar Benjamin
Oscar Benjamin

Reputation: 14470

The change in SymPy 1.13 that prompted this release note is that now SymPy's Integer type will compare unequal to Float and also Floats with different precision will compare unequal. Note that this is specifically about comparing objects with == which in SymPy represents structural equality. Here structural equality concerns whether or not two expressions are identical as expressions rather than representing mathematically equivalent objects. For the latter use Eq:

In [11]: Eq(Integer(2), Float(2))
Out[11]: True

In [12]: Integer(2) == Float(2)
Out[12]: False

SymPy has always had problems with == because this operator is implemented by the method __eq__ and then the same method is used both for an explicit comparison like A == B and also by Python's basic hash-based data structures like set and dict. Ideally it would be possible to handle comparisons like A == B with one method (e.g. __eq__) and then have sets and dicts use some other kind of comparison method like a hypothetical __strict_eq__ method. Unfortunately this is impossible in Python. SymPy's implementation of __eq__ was sort of a kludge that tried to satisfy what users might expect for A == B with number types while also keeping SymPy's internals consistent with respect to == meaning a structural comparison of expressions.

As of SymPy 1.13 this tension is resolved in favour of saying that == for SymPy expressions is strictly for structural equality which in turn means that Integer(2) == Float(2) needs to be False. This also made it possible to fix Float.__hash__ because previously when Integer(2) == Float(2) was True this would have caused sets and dicts to treat them interchangeably as is done for Python float and int:

In [13]: {1, 1.0}
Out[13]: {1}

In [14]: {1.0, 1}
Out[14]: {1.0}

This behaviour would have been unacceptable for SymPy expressions especially in light of the way that many operations are cached using things like dict. The fact that this did not happen with Integer and Float in SymPy 1.12 is because Float.__hash__ returned incorrect results:

In [1]: {Integer(2), Float(2)}
Out[1]: {2, 2.0}

In [2]: {Float(2), Integer(2)}
Out[2]: {2, 2.0}

In [3]: hash(Integer(2))
Out[3]: 2

In [4]: hash(Float(2))
Out[4]: -139014436445701027

In [5]: Integer(2) == Float(2)
Out[5]: True

So Float.__hash__ returning incorrect results was necessary to get the set/dict behaviour that SymPy needs but this also relied on hoping that there is never a hash collision. Fixing it so that Integer(2) == Float(2) evaluates to False now means that a hash collision does not cause the set to drop one of the expressions so Float.__hash__ can be made consistent with Float.__eq__ without in turn causing many other bugs.

The reason that the release note cautions against mixing SymPy expressions with ordinary numbers is that the results can be unpredictable much like {1, 1.0} and {1.0, 1} which can result in an int or float being in the set. If you care whether the object is an int or a float then it is not really acceptable that set treats them as interchangeable. Likewise if you care whether your object is a SymPy expression or not then you also need to care e.g.:

In [3]: type({Integer(1), 1}.pop())
Out[3]: sympy.core.numbers.One

In [4]: type({1, Integer(1)}.pop())
Out[4]: int

In [5]: {Integer(1), 1}.pop().evalf()
Out[5]: 1.00000000000000

In [6]: {1, Integer(1)}.pop().evalf()
...
AttributeError: 'int' object has no attribute 'evalf'

Python's int type does not have the same methods as a SymPy expression so it is problematic that the set can unpredictably return either an int or an Integer. This ambiguity does not arise (as of SymPy 1.13) if all of the elements of the set are SymPy expressions.

The behaviour of sets and dicts for SymPy expressions and for basic types is not equivalent which is another reason that they should not be mixed together:

In [15]: {2, 2.0}
Out[15]: {2}

In [16]: {Integer(2), Float(2)}
Out[16]: {2, 2.0}

If you have say a list of mixed type items that are possibly SymPy expressions or possibly int or float or other types then the behaviour of set(things) can be unpredictable. If all the items are SymPy expressions then it will always be predictable. The sympify function converts standard Python types into SymPy expressions so another way to express this is to say that all of the elements of the set should be sympified.

Upvotes: 2

jsbueno
jsbueno

Reputation: 110121

To put it in other words: it is possible to elements with the same hash to be in the same set, or in the same dictionary as keys - after hash comparison, Python also performs an equality check (==, using the __eq__ method) to ensure the the object found is indeed the one that was being searched for.

When two (or more) elements with the same hash are in the same set, it's called a "hash collision" - the inner workings for dicts and sets simply compare the repeated-hash elements linearly with eq when searching.

Here is a quick timing example with a custom class deliberately created to cause hash collisions:


In [12]: class Clash(int):
    ...:     def __hash__(self):
    ...:         return 0
    ...:

In [13]: class NoClash(int):
    ...:     pass
    ...:


# for inputs 14-15: I tried to fill a set with "Clash" elements, but I worried about it taking too much time, so I aborted it - with 10_000 I got sub 1second:

In [16]: %time clash_set = {Clash(i) for i in range(10_001)}
CPU times: user 826 ms, sys: 93 μs, total: 826 ms
Wall time: 826 ms

In [17]: %time noclash_set = {NoClash(i) for i in range(10_001)}
CPU times: user 3.45 ms, sys: 0 ns, total: 3.45 ms
Wall time: 3.46 ms


In [18]: %time Clash(10_000) in clash_set
CPU times: user 411 μs, sys: 7 μs, total: 418 μs
Wall time: 423 μs
Out[18]: True

In [19]: %time 10_000 in noclash_set
CPU times: user 12 μs, sys: 1 μs, total: 13 μs
Wall time: 18.4 μs
Out[19]: True


In [20]: %time NoClash(10_000) in noclash_set
CPU times: user 3 μs, sys: 0 ns, total: 3 μs
Wall time: 4.29 μs
Out[20]: True

In time: actually, as sets and dicts won't have 2**64 slots, and usually not 2 ** 32 slots for data, being actually proportional to the data size, what is used is a remainder (or other "hash from the hash") of the hash value to compute the actual slot index a data item is placed upon. Say that ordinary sets are initially allocated to contain 128 elements (a wild guess) - in practice, when populating this, there will be many more collisions (at a certain threshold, the internal set heuristics will allocate more memory and re-arrange the slots in order to keep efficiency with the growing data content)

Upvotes: 1

no comment
no comment

Reputation: 10143

Those two elements simply aren't equal.

A simpler example is {-1, -2}. Both elements have hash -2, but you wouldn't expect one of them to be kicked out of the set because of that, would you?

Upvotes: 3

Related Questions