Reputation: 13135
I was reading through SymPy 1.13.0 release notes when an entry caught my attention (emphasis mine):
The hash function for
Floats
and expressions involving Float now respects the hash invariant that ifa == b
thenhash(a) == hash(b)
. This ensures that it is possible to use such expressions in a guaranteed deterministic way in Python's coreset
anddict
data structures. It is however not recommended to mix up sympyBasic
objects with non-sympy number types such as core Python'sfloat
orint
in the sameset/dict
.
I mixed up Python's numbers with SymPy's numbers in a set, and I can't explain what's going on (sympy 1.13.0).
I thought elements of a set all have different hashes. For example, in the following code block there are 4 strings, a, b, c1, c2
. They are distinct objects in memory. However, c1
is equal to c2
, so they have the same hash. Hence, set([a, b, c1, c2])
only contains three elements, as expected:
a, b = "a", "b"
c1 = "This is a test"
c2 = "This is a test"
t = [a, b, c1, c2]
print("ids:", ", ".join([str(id(e)) for e in t]))
print("hashes:", ", ".join([str(hash(e)) for e in t]))
s = set(t)
print(s)
# hashes: -8635426522778860779, 3774980000107278733, 3487163462126586929, 3487163462126586929
# ids: 11791568, 11792624, 135187170594160, 135187170598448
# {'This is a test', 'a', 'b'}
In the following exampe there are three different objects in memory, n1, n2, n3
. They all generates the same hash. I expected the resulting set to only contain one element. Instead, it contains two elements, apparently sharing the same hash.
from sympy import *
n1 = 2.0
n2 = Float(2.0)
n3 = Float(2.0, precision=80)
t = [n1, n2, n3]
print("ids:", ", ".join([str(id(e)) for e in t]))
print("hashes:", ", ".join([str(hash(e)) for e in t]))
s = set(t)
print(s)
# ids: 135913432385776, 135912654307472, 135912654307632
# hashes: 2, 2, 2
# {2.0, 2.0000000000000000000000}
What is going on?
Upvotes: 0
Views: 114
Reputation: 14470
The change in SymPy 1.13 that prompted this release note is that now SymPy's Integer
type will compare unequal to Float
and also Floats
with different precision will compare unequal. Note that this is specifically about comparing objects with ==
which in SymPy represents structural equality. Here structural equality concerns whether or not two expressions are identical as expressions rather than representing mathematically equivalent objects. For the latter use Eq
:
In [11]: Eq(Integer(2), Float(2))
Out[11]: True
In [12]: Integer(2) == Float(2)
Out[12]: False
SymPy has always had problems with ==
because this operator is implemented by the method __eq__
and then the same method is used both for an explicit comparison like A == B
and also by Python's basic hash-based data structures like set
and dict
. Ideally it would be possible to handle comparisons like A == B
with one method (e.g. __eq__
) and then have sets and dicts use some other kind of comparison method like a hypothetical __strict_eq__
method. Unfortunately this is impossible in Python. SymPy's implementation of __eq__
was sort of a kludge that tried to satisfy what users might expect for A == B
with number types while also keeping SymPy's internals consistent with respect to ==
meaning a structural comparison of expressions.
As of SymPy 1.13 this tension is resolved in favour of saying that ==
for SymPy expressions is strictly for structural equality which in turn means that Integer(2) == Float(2)
needs to be False
. This also made it possible to fix Float.__hash__
because previously when Integer(2) == Float(2)
was True
this would have caused sets and dicts to treat them interchangeably as is done for Python float and int:
In [13]: {1, 1.0}
Out[13]: {1}
In [14]: {1.0, 1}
Out[14]: {1.0}
This behaviour would have been unacceptable for SymPy expressions especially in light of the way that many operations are cached using things like dict
. The fact that this did not happen with Integer
and Float
in SymPy 1.12 is because Float.__hash__
returned incorrect results:
In [1]: {Integer(2), Float(2)}
Out[1]: {2, 2.0}
In [2]: {Float(2), Integer(2)}
Out[2]: {2, 2.0}
In [3]: hash(Integer(2))
Out[3]: 2
In [4]: hash(Float(2))
Out[4]: -139014436445701027
In [5]: Integer(2) == Float(2)
Out[5]: True
So Float.__hash__
returning incorrect results was necessary to get the set/dict behaviour that SymPy needs but this also relied on hoping that there is never a hash collision. Fixing it so that Integer(2) == Float(2)
evaluates to False
now means that a hash collision does not cause the set to drop one of the expressions so Float.__hash__
can be made consistent with Float.__eq__
without in turn causing many other bugs.
The reason that the release note cautions against mixing SymPy expressions with ordinary numbers is that the results can be unpredictable much like {1, 1.0}
and {1.0, 1}
which can result in an int
or float
being in the set. If you care whether the object is an int
or a float
then it is not really acceptable that set
treats them as interchangeable. Likewise if you care whether your object is a SymPy expression or not then you also need to care e.g.:
In [3]: type({Integer(1), 1}.pop())
Out[3]: sympy.core.numbers.One
In [4]: type({1, Integer(1)}.pop())
Out[4]: int
In [5]: {Integer(1), 1}.pop().evalf()
Out[5]: 1.00000000000000
In [6]: {1, Integer(1)}.pop().evalf()
...
AttributeError: 'int' object has no attribute 'evalf'
Python's int
type does not have the same methods as a SymPy expression so it is problematic that the set can unpredictably return either an int
or an Integer
. This ambiguity does not arise (as of SymPy 1.13) if all of the elements of the set are SymPy expressions.
The behaviour of sets and dicts for SymPy expressions and for basic types is not equivalent which is another reason that they should not be mixed together:
In [15]: {2, 2.0}
Out[15]: {2}
In [16]: {Integer(2), Float(2)}
Out[16]: {2, 2.0}
If you have say a list of mixed type items that are possibly SymPy expressions or possibly int or float or other types then the behaviour of set(things)
can be unpredictable. If all the items are SymPy expressions then it will always be predictable. The sympify
function converts standard Python types into SymPy expressions so another way to express this is to say that all of the elements of the set should be sympified.
Upvotes: 2
Reputation: 110121
To put it in other words: it is possible to elements with the same hash to be in the same set, or in the same dictionary as keys - after hash comparison, Python also performs an equality check (==
, using the __eq__
method) to ensure the the object found is indeed the one that was being searched for.
When two (or more) elements with the same hash are in the same set, it's called a "hash collision" - the inner workings for dicts and sets simply compare the repeated-hash elements linearly with eq
when searching.
Here is a quick timing example with a custom class deliberately created to cause hash collisions:
In [12]: class Clash(int):
...: def __hash__(self):
...: return 0
...:
In [13]: class NoClash(int):
...: pass
...:
# for inputs 14-15: I tried to fill a set with "Clash" elements, but I worried about it taking too much time, so I aborted it - with 10_000 I got sub 1second:
In [16]: %time clash_set = {Clash(i) for i in range(10_001)}
CPU times: user 826 ms, sys: 93 μs, total: 826 ms
Wall time: 826 ms
In [17]: %time noclash_set = {NoClash(i) for i in range(10_001)}
CPU times: user 3.45 ms, sys: 0 ns, total: 3.45 ms
Wall time: 3.46 ms
In [18]: %time Clash(10_000) in clash_set
CPU times: user 411 μs, sys: 7 μs, total: 418 μs
Wall time: 423 μs
Out[18]: True
In [19]: %time 10_000 in noclash_set
CPU times: user 12 μs, sys: 1 μs, total: 13 μs
Wall time: 18.4 μs
Out[19]: True
In [20]: %time NoClash(10_000) in noclash_set
CPU times: user 3 μs, sys: 0 ns, total: 3 μs
Wall time: 4.29 μs
Out[20]: True
In time: actually, as sets and dicts won't have 2**64 slots, and usually not 2 ** 32 slots for data, being actually proportional to the data size, what is used is a remainder (or other "hash from the hash") of the hash value to compute the actual slot index a data item is placed upon. Say that ordinary sets are initially allocated to contain 128 elements (a wild guess) - in practice, when populating this, there will be many more collisions (at a certain threshold, the internal set
heuristics will allocate more memory and re-arrange the slots in order to keep efficiency with the growing data content)
Upvotes: 1
Reputation: 10143
Those two elements simply aren't equal.
A simpler example is {-1, -2}
. Both elements have hash -2
, but you wouldn't expect one of them to be kicked out of the set because of that, would you?
Upvotes: 3