Hashing function to distribute over n values (with a twist)

Question

I was wondering if there are any hashing functions to distribute input over n values. The distribution should of course be fairly uniform. But there is a twist. with small changes of n, few elements should get a new hash. Optimally it should split all k uniformly over n values and if n increases to n+1 only k/n-k/(n+1) values would have to move to uniformly distribute in the new hash. Obviously having a hash which simply creates uniform values and then mod it would work, but that would move a lot of hashes to fill the new node. The goal here is that as few values as possible falls into a new bucket.

Daniel Wagner · Accepted Answer

Suppose 2^{n-1} < N <= 2^n. Then there is a standard trick for turning a hash function H that produces (at least) n bits into one that produces a number from 0 to N.

Compute H(v).
Keep just the first n bits.
If that's smaller than N, stop and output it. Otherwise, start from the top with H(v) instead of v.

Some properties of this technique:

You might worry that you have to repeat the loop many times in some cases. But actually the expected number of loops is at most 2.
If you bump up N and n doesn't have to change, very few things get a new hash: only those ones that had exactly N somewhere in their chain of hashes. (Of course, identifying which elements have this property is kind of hard -- in general it may require rehashing every element!)
If you bump up N and n does have to change, about half of the elements have to be rebucketed. But this happens more and more rarely the bigger N is -- it is an amortized O(1) cost on each bump.

Edit to add an additional comment about the "have to rehash everything" requirement: One might consider modifying step 3 above to "start from the top with the first n bits of H(v)" instead. This reduces the problem with identifying which elements need to be rehashed -- since they'll be in the bucket for the hash of N -- though I'm not confident the resulting hash will have quite as good collision avoidance properties. It certainly makes the process a bit more fragile -- one would want to prove something special about the choice of H (that the bottom few bits aren't "critical" to its collision avoidance properties somehow).

Here is a simple example implementation in Python, together with a short main that shows that most strings do not move when bumping normally, and about half of strings get moved when bumping across a 2^n boundary. Forgive me for any idiosyncracies of my code -- Python is a foreign language.

import math

def ilog2(m): return int(math.ceil(math.log(m,2)))

def hash_into(obj, N):
    cur_hash = hash(obj)
    mask = pow(2, ilog2(N)) - 1
    while (cur_hash & mask) >= N:
        # seems Python uses the identity for its hash on integers, which
        # doesn't iterate well; let's use literally any other hash at all
        cur_hash = hash(str(cur_hash))
    return cur_hash & mask

def same_hash(obj, N, N2):
    return hash_into(obj, N) == hash_into(obj, N2)

def bump_stat(objs, N):
    return len([obj for obj in objs if same_hash(obj, N, N+1)])

alphabet = [chr(x) for x in range(ord('a'),ord('z')+1)]
ascending = alphabet + [c1 + c2 for c1 in alphabet for c2 in alphabet]

def main():
    print len(ascending)
    print bump_stat(ascending, 10)
    print float(bump_stat(ascending, 16))/len(ascending)
# prints:
# 702
# 639
# 0.555555555556

Hashing function to distribute over n values (with a twist)

Answers (2)

Related Questions