Reputation: 31

Python2 hash values badly distributed

When using Python's built in hash() function on strings, I was just playing with it when I noticed something odd. Typically, a normal hash function is supposed to be uncorrelated, in the sense that from hash(A), hash(B) should be completely unrecognizable (for sufficient definitions of uncorrelated/unrecognizable).

However, this quick little script shows otherwise

In [1]: for i in range(15):
...:     print hash('test{0}'.format(i))
...:
-5092793511388848639
-5092793511388848640
-5092793511388848637
-5092793511388848638
-5092793511388848635
-5092793511388848636
-5092793511388848633
-5092793511388848634
-5092793511388848631
-5092793511388848632
5207588497627702649
5207588497627702648
5207588497627702651
5207588497627702650
5207588497627702653

I understand Python's hash() function isn't supposed to be cryptographically secure by any stretch, and for that you would use the hashlib library, but why are the values of testX so regularly distributed? This seems to me like it could have poor collision behavior.

Upvotes: 2

Answers (3)

Raymond Hettinger

Reputation: 226694

The explanation can be found in the comments for the source code of Python2.7's Objects/dictobject.c:

Major subtleties ahead: Most hash schemes depend on having a "good" hash function, in the sense of simulating randomness. Python doesn't: its most important hash functions (for strings and ints) are very regular in common cases:
>>> map(hash, (0, 1, 2, 3)) 
[0, 1, 2, 3]
>>> map(hash, ("namea", "nameb", "namec", "named"))
[-1658398457, -1658398460, -1658398459, -1658398462]
This isn't necessarily bad! To the contrary, in a table of size 2**i, taking the low-order i bits as the initial table index is extremely fast, and there are no collisions at all for dicts indexed by a contiguous range of ints. The same is approximately true when keys are "consecutive" strings. So this gives better-than-random behavior in common cases, and that's very desirable.

Upvotes: 1

Eric Duminil

Reputation: 54303

The hash is calculated one character after the other. That's why the hashes are so similar.

During the computation, "test0" and "test1" have the exact same hash up to "test". There's only one bit difference, in the last character. In secure hashes, changing one bit anywhere should completely change the whole hash (e.g. thanks to multiple passes).

You can check this behaviour by calculating the hash of "0test" and "1test":

>>> for i in range(15):
...     print hash('{0}test'.format(i))
... 
-2218321119694330423
-198347807511608008
-8430555520134600289
1589425791872121742
-6642709920510870371
-4622800608552147860
8038463826323963107
2058173137418684322
-8620450647505857711
-6600477335291135136
8795071937164440413
4111679291630235372
-765820399655801141
2550858955145994266
6363120682850473265

This is the kind of widespread distribution you were expecting, right? By the way, Python 3 seems to have a different hash computation for strings.

For more information about Python2 string hash, take a look at "Python Hash Algorithms":

class string:
    def __hash__(self):
        if not self:
            return 0 # empty
        value = ord(self[0]) << 7
        for char in self:
            value = c_mul(1000003, value) ^ ord(char)
        value = value ^ len(self)
        if value == -1:
            value = -2
        return value

By the way, this problem isn't related to Python. In Java, "Aa" and "BB" share the same hash.

Upvotes: 2

hiro protagonist

Reputation: 46921

the python hash function is not a cryptographic hash (i.e. must not protect against collisions or show an avalanche effect etc.); its just a identifier (e.g. to be used as dictionary keys) for objects.

read more about __hash__ and hash in the documentation.

as stated there:

dict. __hash__() should return an integer. The only required property is that objects which compare equal have the same hash value

and - as Jean-François Fabre pointed out in a comment - python hashes must be fast (i.e. to build dictionaries). cryptographic hashes are slow and therefore unusable for that.

by the way: in python 3 the distribution looks way more random.

Upvotes: 2

Python2 hash values badly distributed

Answers (3)

Related Questions