ophilbinbriscoe
ophilbinbriscoe

Reputation: 137

Mapping hash values to a range, with minimal collisions

Context

Hi, I'm working on an assignment for school that asks us to implement a hash table in Java. There are no requirements that collisions be kept to a minimum, but low collision rate and speed seem to be the two most sought-after qualities in all the reading (some more) that I've done.

Problem

I'd like some guidance on how to map the output of a hash function to a smaller range, without having >20% of my keys collide (yikes).

In all of the algorithms that I've explored, keys are mapped to the entire range of an unsigned 32 bit integer (or in many cases, 64, even 128 bit). I'm not finding much about this on here, Wikipedia, or in any of the hash-related articles / discussions I've come across.

In terms of the specifics of my implementation, I'm working in Java (mandate of my school), which is problematic since there are no unsigned types to work with. To get around this, I've been using the 64-bit long integer type, then using a bit mask to map back down to 32 bits. Instead of simply truncating, I XOR the top 32 bits with the bottom 32, then perform a bitwise AND to mask out any upper bits that might result in a negative value when I cast it down to a 32 bit integer. After all that, a separate function compresses the resulting hash value down to fit into the bounds of the hash table's inner array.

It ends up looking like:

int hash( String key ) {

    long h;

    for( int i = 0; i < key.length(); i++ )
        //do some stuff with each character in the key

        h = h ^ ( h << 32 );

    return h & 2147483647;
}

Where the inner-loop depends on the hash function (I've implemented a few: polynomial hashing, FNV1, SuperFastHash, and a custom one tailored to the input data).

They basically all perform horribly. I have yet to see <20% keys collide. Even before I compress the hash values down to array indices, none of my hash functions will get me less thank 10k collisions. My inputs are two text files, each ~220,000 lines. One is English words, the other is random strings of varying length.

My lecture notes recommend the following, for compressing the hashed keys:

(hashed key) % P

Where P is the largest prime < the size of the inner array.

Is this an accepted method of compressing hash values? I have a feeling it isn't, but since performance is so poor even before compression, I have a feeling it's not the primary culprit, either.

Upvotes: 2

Views: 1925

Answers (1)

Francisco Hernandez
Francisco Hernandez

Reputation: 2468

I don´t know if I understand well your concrete problem, but I will try to help in hash performance and collisions.

The hash based objects will determine in which bucket they will store the key-value pair based on hash value. Inside each bucket there is a structure (In HashMap case a LinkedList) in where the pair is stored.

If the hash value is usually the same, the bucket will be usually the same so the performance will degrade a lot, let´s see an example:

Consider this class

package hashTest;

import java.util.Hashtable;

public class HashTest {

    public static void main (String[] args) {

        Hashtable<MyKey, String> hm = new Hashtable<>();

        long ini = System.currentTimeMillis();

        for (int i=0; i<100000; i++) {
            MyKey a = new HashTest().new MyKey(String.valueOf(i));

            hm.put(a, String.valueOf(i));
        }

        System.out.println(hm.size());

        long fin = System.currentTimeMillis();
        System.out.println("tiempo: " + (fin-ini) + " mls");
    }

    private class MyKey {

        private String str;

        public MyKey(String i) {
            str = i;
        }

        public String getStr() {
            return str;
        }

        @Override
        public int hashCode() {
            return 0;
        }

        @Override
        public boolean equals(Object o) {
            if (o instanceof MyKey) {
                MyKey aux = (MyKey) o;
                if (this.str.equals(aux.getStr())) {
                    return true;
                }
            }
            return false;
        }
    }
}

Note that hashCode in class MyKey returns always '0' as hash. It is ok with the hashcode definition (http://docs.oracle.com/javase/7/docs/api/java/lang/Object.html#hashCode()). If we run that program, this is the result

100000 
tiempo: 62866 mls

Is a very poor performance, now we are going to change the MyKey hashcode code:

package hashTest;

import java.util.Hashtable;

public class HashTest {

    public static void main (String[] args) {

        Hashtable<MyKey, String> hm = new Hashtable<>();

        long ini = System.currentTimeMillis();

        for (int i=0; i<100000; i++) {
            MyKey a = new HashTest().new MyKey(String.valueOf(i));

            hm.put(a, String.valueOf(i));
        }

        System.out.println(hm.size());

        long fin = System.currentTimeMillis();
        System.out.println("tiempo: " + (fin-ini) + " mls");
    }

    private class MyKey {

        private String str;

        public MyKey(String i) {
            str = i;
        }

        public String getStr() {
            return str;
        }

        @Override
        public int hashCode() {
            return str.hashCode() * 31;
        }

        @Override
        public boolean equals(Object o) {
            if (o instanceof MyKey) {
                MyKey aux = (MyKey) o;
                if (this.str.equals(aux.getStr())) {
                    return true;
                }
            }
            return false;
        }
    }
}

Note that only hashcode in MyKey has changed, now when we run the code te result is

100000
tiempo: 47 mls

There is an incredible better performance now with a minor change. Is a very common practice return the hashcode multiplied by a prime number (in this case 31), using the same hashcode members that you use inside equals method in order to determine if two objects are the same (in this case only str).

I hope that this little example can you point out a solution for your problem.

Upvotes: 1

Related Questions