Hash Table Double Hashing

I am given two hash functions that I should use for insertion and deletion into the table;

int hash1(int key)
{
    return (key % TABLESIZE);
}

int hash2(int key)
{
    return (key % PRIME) + 1;
}

I'm confused on how I utilize them.

Do I:

Use hash1 first, and if the slot is taken at the return value, use hash2?
Do I use hash1 then add the result to hash2's output?
Do I use hash's output as hash2's input?

Upvotes: 0

Answers (1)

Tony Delroy

Reputation: 106096

TLDR: bucket_selected = hash1(hash2(key))

hash1() is an identity hash from key to bucket (it folds the keys into the table size). If the table size happens to be a power of two, it effectively masks out some number of low bits, discarding the high bits; for example, with table size 256, it's effective returning key & 255: the least significant 8 bits. This will be somewhat collision prone if the keys aren't either:

mostly contiguous numbers - possibly with a few small gaps, such that they cleanly map onto successive buckets most of the time, or
pretty random in the low bits used, so they scatter across the buckets

If table size is not a power of two, and ideally is a prime, the high order bits help spread the keys around the buckets. Collisions are just as likely for e.g. random numbers, but on computer systems sometimes different bits in a key are more or less likely to vary (for example, doubles in memory consist of a sign, many mantissa, and many exponent bits: if your numbers are of similar magnitude, the exponents won't vary much between them), or there are patterns based on power-of-two boundaries. For example, if you have 4 ASCII characters packed into a uint32_t key, then a % table_size hash with table_size 256 extracts just one of the characters as the hash value. If table size was instead 257, then varying any of the characters would change the bucket selected.

(key % PRIME) + 1 comes close to doing what hash1() would do if the table size was prime, but why add 1? Some languages do index their arrays from 1, which is the only good reason I can think of, but if dealing with such a language, you'd probably want hash1() to add 1 too. To explain the potential use for hash2(), let's take a step back first...

Real general-purpose hash table implementations need to be able to create tables of different sizes - whatever suits the program - and indeed, often applications want the table to "resize" or grow dynamically if more elements are inserted than it can handle well. Because of that, a hash function such as hash1 would be dependent on the hash table implementation or calling code to tell it the current table size. It's normally more convenient if the hash functions can be written independently of any given hash table implementation, only needing the key as input. What many hash functions do is hash the key to a number of a certain size, e.g. a uint32_t or uint64_t. Clearly that means there may be more hash values than there are buckets in the hash table, so a % operation (or faster bitwise-& operation if the # buckets is a power of two) is then used to "fold" the hash value back onto the buckets. So, a good hash table implementation usually accepts a hash function generating e.g. uint32_t or uint64_t output and internally does the % or &.

In the case of hash1 - it can be used:

as an identity hash folding the key to a bucket, or
to fold a hash value from another hash function to a bucket.

In the second usage, that second hash function could be hash2. This might make sense if the keys given to hash2 were typically much larger than the PRIME used, and yet the PRIME was in turn much larger than the number of buckets. To explain why this is desirable, let's take another step back...

Say you have 8 buckets and a hash function that produces a number in the range [0..10] with uniform probability: if you % the hash values into the table size, hash values 0..7 will map to buckets 0..7, and hash values 8..10 will map to buckets 0..2: buckets 0..2 can be expected to have about twice as many keys collide there as the other buckets. When the range of hash values is vastly larger than the number of buckets, the significance of having some buckets %-ed to once more than other buckets is correspondingly tiny. Alternatively, if you have say a hash function outputting 32-bit numbers (so the number of distinct hash values is a power of two), then % by a smaller power-of-two will map exactly the same number of hash values to each bucket.

So, let's return to my earlier assertion: hash2()'s potential utility is actually to use it like this:

bucket_selected = hash1(hash2(key))

In the above formula - hash1 distributes across the buckets but prevents out-of-bounds bucket access; to work reasonable hash2 should output a range of numbers much larger than the number of buckets, but it won't do anything at all unless the keys span a range larger than PRIME, and ideally they'd span a range vastly larger than PRIME, increasing the odds of hash values from hash2(key) forming a near-uniform distribution between 1 and PRIME.

Upvotes: 3

Hash Table Double Hashing

Answers (1)

Related Questions