Reputation: 31
When using Python's built in hash()
function on strings, I was just playing with it when I noticed something odd. Typically, a normal hash function is supposed to be uncorrelated, in the sense that from hash(A)
, hash(B)
should be completely unrecognizable (for sufficient definitions of uncorrelated/unrecognizable).
However, this quick little script shows otherwise
In [1]: for i in range(15):
...: print hash('test{0}'.format(i))
...:
-5092793511388848639
-5092793511388848640
-5092793511388848637
-5092793511388848638
-5092793511388848635
-5092793511388848636
-5092793511388848633
-5092793511388848634
-5092793511388848631
-5092793511388848632
5207588497627702649
5207588497627702648
5207588497627702651
5207588497627702650
5207588497627702653
I understand Python's hash()
function isn't supposed to be cryptographically secure by any stretch, and for that you would use the hashlib
library, but why are the values of testX
so regularly distributed? This seems to me like it could have poor collision behavior.
Upvotes: 2
Views: 549
Reputation: 226694
The explanation can be found in the comments for the source code of Python2.7's Objects/dictobject.c:
Major subtleties ahead: Most hash schemes depend on having a "good" hash function, in the sense of simulating randomness. Python doesn't: its most important hash functions (for strings and ints) are very regular in common cases:
>>> map(hash, (0, 1, 2, 3)) [0, 1, 2, 3] >>> map(hash, ("namea", "nameb", "namec", "named")) [-1658398457, -1658398460, -1658398459, -1658398462]
This isn't necessarily bad! To the contrary, in a table of size 2**i, taking the low-order i bits as the initial table index is extremely fast, and there are no collisions at all for dicts indexed by a contiguous range of ints. The same is approximately true when keys are "consecutive" strings. So this gives better-than-random behavior in common cases, and that's very desirable.
Upvotes: 1
Reputation: 54303
The hash is calculated one character after the other. That's why the hashes are so similar.
During the computation, "test0"
and "test1"
have the exact same hash up to "test"
. There's only one bit difference, in the last character. In secure hashes, changing one bit anywhere should completely change the whole hash (e.g. thanks to multiple passes).
You can check this behaviour by calculating the hash of "0test" and "1test":
>>> for i in range(15):
... print hash('{0}test'.format(i))
...
-2218321119694330423
-198347807511608008
-8430555520134600289
1589425791872121742
-6642709920510870371
-4622800608552147860
8038463826323963107
2058173137418684322
-8620450647505857711
-6600477335291135136
8795071937164440413
4111679291630235372
-765820399655801141
2550858955145994266
6363120682850473265
This is the kind of widespread distribution you were expecting, right? By the way, Python 3 seems to have a different hash computation for strings.
For more information about Python2 string hash, take a look at "Python Hash Algorithms":
class string:
def __hash__(self):
if not self:
return 0 # empty
value = ord(self[0]) << 7
for char in self:
value = c_mul(1000003, value) ^ ord(char)
value = value ^ len(self)
if value == -1:
value = -2
return value
By the way, this problem isn't related to Python. In Java, "Aa"
and "BB"
share the same hash.
Upvotes: 2
Reputation: 46921
the python hash
function is not a cryptographic hash (i.e. must not protect against collisions or show an avalanche effect etc.); its just a identifier (e.g. to be used as dictionary keys) for objects.
read more about __hash__
and hash in the documentation.
as stated there:
dict. __hash__()
should return an integer. The only required property is that objects which compare equal have the same hash value
and - as Jean-François Fabre pointed out in a comment - python hashes must be fast (i.e. to build dictionaries). cryptographic hashes are slow and therefore unusable for that.
by the way: in python 3 the distribution looks way more random.
Upvotes: 2