user173342
user173342

Reputation: 1840

Hash integers into N roughly equal sized buckets directly based on upper bits (maintaining order)

Preferably a quick way. The N = 2^b case is pretty easy. For it, first I'd figure out how many bits are in my chosen type:

typedef unsigned int type;
size_t size = sizeof(type) * 8;

Then I would perform a right shift by the proper amount of bits to produce a hash key of the upper b bits.

type input = 0x657;
unsigned char b = 4;
unsigned char hash = input >> (size - b);

But what if I wanted N = 3? Or any other non power of 2? Assuming my N will always fit inside an unsigned char (so it will be 256 at most), what would be the quickest way to hash some input? While keeping the buckets no more than +/- 1 difference in range, and also preserving the order of the upper bits of input, like the above function does.

Upvotes: 3

Views: 223

Answers (1)

rici
rici

Reputation: 241691

For 32-bit values, do a 64-bit multiply with N and keep the top 32 bits. (Similarly for other sizes, although if you have 64-bit values the multiplication gets trickier.)

Here's a basic proof outline.

It's obvious that this mapping is order-preserving; the only question is how many values fall into each bucket. Now, consider some bucket j and find the smallest i which maps to that bucket. For i to fall into bucket j means that Ni − j×232 = m where 0 ≤ m < 232, but if i is the smallest such value, 0 ≤ m < N. (Otherwise, i−1 would also fall into bucket j.)

Now, define w = ⌊232∕N⌋, which is equivalent to saying that Nw − 232 = −m' where 0 ≤ m' < N. By adding those two formulas up, we discover that Ni + Nw - j×232 - 232 = m−m'; simplifying, we get N(i+w)-(j+1)×232 = m−m' and −N < m−m' < N. That means that either i + w or i + w + 1 is the smallest value which maps to j + 1 (depending on whether m − m' is negative or not), which means that there are either w or w + 1 values which map to j. Since that's true for any j, we can definitely say that there are only two bucket sizes, one of which is ⌊232∕N⌋. It's not far from there to the assertion I made in my comment (now deleted) which is that the other possible bucket size is ⌈232∕N⌉.

There's nothing magic about 232 in the above proof; I could have used any value M. But that would have made the condensed proof even hard to read.

Upvotes: 3

Related Questions