wim
wim

Reputation: 362786

What is the default hash of user defined classes?

The docs incorrectly claim that

Objects which are instances of user-defined classes are hashable by default; they all compare unequal (except with themselves), and their hash value is their id()

Although I recall this being correct once, such objects hashing equal to their id is apparently not true in current versions of python (v2.7.10, v3.5.0).

>>> class A:
...     pass
... 
>>> a = A()
>>> hash(a)
-9223372036578022804
>>> id(a)
4428048072

In another part of the docs it's said that the hash is derived from the id. When/why did the implementation change, and how is the number returned by hash "derived from" the id now?

Upvotes: 4

Views: 1453

Answers (2)

Martijn Pieters
Martijn Pieters

Reputation: 1122222

This changed in 2009, as a result of issue #5186; the usual id() values caused too many collisions:

In the issue 5169 discussion, Antoine Pitrou suggested that for an object 
x without a `__hash__` method, `id()/8` might be a better hash value than 
`id()`, since dicts use the low order bits of the hash as initial key, and 
the 3 lowest bits of an `id()` will always be zero.

The current implementation takes the id and rotates it to produce a more varying value:

long
_Py_HashPointer(void *p)
{
    long x;
    size_t y = (size_t)p;
    /* bottom 3 or 4 bits are likely to be 0; rotate y by 4 to avoid
       excessive hash collisions for dicts and sets */
    y = (y >> 4) | (y << (8 * SIZEOF_VOID_P - 4));
    x = (long)y;
    if (x == -1)
        x = -2;
    return x;
}

This resulted in a 14% to 34% speedup, depending on the test performed.

The glossary is simply out of date; I see you already opened an issue with the project.

Upvotes: 2

circular-ruin
circular-ruin

Reputation: 2884

The relevant function appears to be:

Py_hash_t
_Py_HashPointer(void *p)
{
    Py_hash_t x;
    size_t y = (size_t)p;
    /* bottom 3 or 4 bits are likely to be 0; rotate y by 4 to avoid
       excessive hash collisions for dicts and sets */
    y = (y >> 4) | (y << (8 * SIZEOF_VOID_P - 4));
    x = (Py_hash_t)y;
    if (x == -1)
        x = -2;
    return x;
}

(that code comes from here, and is then used to be the tp_hash slot in type here.) The comment there seems to give a reason for not using the pointer (which is the same thing as the id) directly. Indeed, the commit that introduced that change to the function is here, and states that the reason for the change is:

Issue #5186: Reduce hash collisions for objects with no hash method by rotating the object pointer by 4 bits to the right.

which refers to this issue, which explains more why the change was made.

Upvotes: 6

Related Questions