PMunch
PMunch

Reputation: 1734

Hashing function to distribute over n values (with a twist)

I was wondering if there are any hashing functions to distribute input over n values. The distribution should of course be fairly uniform. But there is a twist. with small changes of n, few elements should get a new hash. Optimally it should split all k uniformly over n values and if n increases to n+1 only k/n-k/(n+1) values would have to move to uniformly distribute in the new hash. Obviously having a hash which simply creates uniform values and then mod it would work, but that would move a lot of hashes to fill the new node. The goal here is that as few values as possible falls into a new bucket.

Upvotes: 2

Views: 650

Answers (2)

Daniel Wagner
Daniel Wagner

Reputation: 153182

Suppose 2^{n-1} < N <= 2^n. Then there is a standard trick for turning a hash function H that produces (at least) n bits into one that produces a number from 0 to N.

  1. Compute H(v).
  2. Keep just the first n bits.
  3. If that's smaller than N, stop and output it. Otherwise, start from the top with H(v) instead of v.

Some properties of this technique:

  • You might worry that you have to repeat the loop many times in some cases. But actually the expected number of loops is at most 2.
  • If you bump up N and n doesn't have to change, very few things get a new hash: only those ones that had exactly N somewhere in their chain of hashes. (Of course, identifying which elements have this property is kind of hard -- in general it may require rehashing every element!)
  • If you bump up N and n does have to change, about half of the elements have to be rebucketed. But this happens more and more rarely the bigger N is -- it is an amortized O(1) cost on each bump.

Edit to add an additional comment about the "have to rehash everything" requirement: One might consider modifying step 3 above to "start from the top with the first n bits of H(v)" instead. This reduces the problem with identifying which elements need to be rehashed -- since they'll be in the bucket for the hash of N -- though I'm not confident the resulting hash will have quite as good collision avoidance properties. It certainly makes the process a bit more fragile -- one would want to prove something special about the choice of H (that the bottom few bits aren't "critical" to its collision avoidance properties somehow).

Here is a simple example implementation in Python, together with a short main that shows that most strings do not move when bumping normally, and about half of strings get moved when bumping across a 2^n boundary. Forgive me for any idiosyncracies of my code -- Python is a foreign language.

import math

def ilog2(m): return int(math.ceil(math.log(m,2)))

def hash_into(obj, N):
    cur_hash = hash(obj)
    mask = pow(2, ilog2(N)) - 1
    while (cur_hash & mask) >= N:
        # seems Python uses the identity for its hash on integers, which
        # doesn't iterate well; let's use literally any other hash at all
        cur_hash = hash(str(cur_hash))
    return cur_hash & mask

def same_hash(obj, N, N2):
    return hash_into(obj, N) == hash_into(obj, N2)

def bump_stat(objs, N):
    return len([obj for obj in objs if same_hash(obj, N, N+1)])

alphabet = [chr(x) for x in range(ord('a'),ord('z')+1)]
ascending = alphabet + [c1 + c2 for c1 in alphabet for c2 in alphabet]

def main():
    print len(ascending)
    print bump_stat(ascending, 10)
    print float(bump_stat(ascending, 16))/len(ascending)
# prints:
# 702
# 639
# 0.555555555556

Upvotes: 3

Matt Timmermans
Matt Timmermans

Reputation: 59323

Well, when you add a node, you will want it to fill up, so you will actually want k/(n+1) elements to move from their old nodes to the new one.

That is easily accomplished:

Just generate a hash value for each key as you normally would. Then, to assign key k to a node in [0,N):

Let H(k) be the hash of k.

int hash = H(k);
for (int n=N-1;n>0;--n) {
   if ((mix(hash,n) % (i+1))==0) {
      break;
   }
}

//put it in node n

So, when you add node node 1, it steals half the items from node 0. When you add node 2, it steals 1/3 of the items from the previous 2 nodes. And so on...

EDIT: added the mix() function, to mix up the hash differently for every n -- otherwise you get non-uniformities when n is not prime.

Upvotes: 0

Related Questions