Mapping hash values to a range, with minimal collisions

Question

Context

Hi, I'm working on an assignment for school that asks us to implement a hash table in Java. There are no requirements that collisions be kept to a minimum, but low collision rate and speed seem to be the two most sought-after qualities in all the reading (some more) that I've done.

Problem

I'd like some guidance on how to map the output of a hash function to a smaller range, without having >20% of my keys collide (yikes).

In all of the algorithms that I've explored, keys are mapped to the entire range of an unsigned 32 bit integer (or in many cases, 64, even 128 bit). I'm not finding much about this on here, Wikipedia, or in any of the hash-related articles / discussions I've come across.

In terms of the specifics of my implementation, I'm working in Java (mandate of my school), which is problematic since there are no unsigned types to work with. To get around this, I've been using the 64-bit long integer type, then using a bit mask to map back down to 32 bits. Instead of simply truncating, I XOR the top 32 bits with the bottom 32, then perform a bitwise AND to mask out any upper bits that might result in a negative value when I cast it down to a 32 bit integer. After all that, a separate function compresses the resulting hash value down to fit into the bounds of the hash table's inner array.

It ends up looking like:

int hash( String key ) {

    long h;

    for( int i = 0; i < key.length(); i++ )
        //do some stuff with each character in the key

        h = h ^ ( h << 32 );

    return h & 2147483647;
}

Where the inner-loop depends on the hash function (I've implemented a few: polynomial hashing, FNV1, SuperFastHash, and a custom one tailored to the input data).

They basically all perform horribly. I have yet to see <20% keys collide. Even before I compress the hash values down to array indices, none of my hash functions will get me less thank 10k collisions. My inputs are two text files, each ~220,000 lines. One is English words, the other is random strings of varying length.

My lecture notes recommend the following, for compressing the hashed keys:

(hashed key) % P

Where P is the largest prime < the size of the inner array.

Is this an accepted method of compressing hash values? I have a feeling it isn't, but since performance is so poor even before compression, I have a feeling it's not the primary culprit, either.

Mapping hash values to a range, with minimal collisions

Context

Problem

Answers (1)

Related Questions