Reputation: 1734
I was wondering if there are any hashing functions to distribute input over n values. The distribution should of course be fairly uniform. But there is a twist. with small changes of n, few elements should get a new hash. Optimally it should split all k uniformly over n values and if n increases to n+1 only k/n-k/(n+1) values would have to move to uniformly distribute in the new hash. Obviously having a hash which simply creates uniform values and then mod it would work, but that would move a lot of hashes to fill the new node. The goal here is that as few values as possible falls into a new bucket.
Upvotes: 2
Views: 650
Reputation: 153182
Suppose 2^{n-1} < N <= 2^n. Then there is a standard trick for turning a hash function H that produces (at least) n bits into one that produces a number from 0 to N.
Some properties of this technique:
Edit to add an additional comment about the "have to rehash everything" requirement: One might consider modifying step 3 above to "start from the top with the first n bits of H(v)" instead. This reduces the problem with identifying which elements need to be rehashed -- since they'll be in the bucket for the hash of N -- though I'm not confident the resulting hash will have quite as good collision avoidance properties. It certainly makes the process a bit more fragile -- one would want to prove something special about the choice of H (that the bottom few bits aren't "critical" to its collision avoidance properties somehow).
Here is a simple example implementation in Python, together with a short main
that shows that most strings do not move when bumping normally, and about half of strings get moved when bumping across a 2^n boundary. Forgive me for any idiosyncracies of my code -- Python is a foreign language.
import math
def ilog2(m): return int(math.ceil(math.log(m,2)))
def hash_into(obj, N):
cur_hash = hash(obj)
mask = pow(2, ilog2(N)) - 1
while (cur_hash & mask) >= N:
# seems Python uses the identity for its hash on integers, which
# doesn't iterate well; let's use literally any other hash at all
cur_hash = hash(str(cur_hash))
return cur_hash & mask
def same_hash(obj, N, N2):
return hash_into(obj, N) == hash_into(obj, N2)
def bump_stat(objs, N):
return len([obj for obj in objs if same_hash(obj, N, N+1)])
alphabet = [chr(x) for x in range(ord('a'),ord('z')+1)]
ascending = alphabet + [c1 + c2 for c1 in alphabet for c2 in alphabet]
def main():
print len(ascending)
print bump_stat(ascending, 10)
print float(bump_stat(ascending, 16))/len(ascending)
# prints:
# 702
# 639
# 0.555555555556
Upvotes: 3
Reputation: 59323
Well, when you add a node, you will want it to fill up, so you will actually want k/(n+1) elements to move from their old nodes to the new one.
That is easily accomplished:
Just generate a hash value for each key as you normally would. Then, to assign key k to a node in [0,N):
Let H(k) be the hash of k.
int hash = H(k);
for (int n=N-1;n>0;--n) {
if ((mix(hash,n) % (i+1))==0) {
break;
}
}
//put it in node n
So, when you add node node 1, it steals half the items from node 0. When you add node 2, it steals 1/3 of the items from the previous 2 nodes. And so on...
EDIT: added the mix() function, to mix up the hash differently for every n -- otherwise you get non-uniformities when n is not prime.
Upvotes: 0