kevinarpe
kevinarpe

Reputation: 21319

Why is this hashCode() method considered poor?

This is a follow-up question for "Using Java 7 HashMap in Java 8". There were a number of interesting comments. Some I understand well; others less.

Why is this hashCode() method considered poor?

Upon first glance, I thought it reasonable. Maybe 17 could be increased to 31. Otherwise, it seems to follow generally accepted formula from Arrays.hashCode(Object[]). One guess: It works for general cases where number of items is relatively small (less than 10.000), but performs poorly for very large sets (1.000.000 or more).

Here is the original code: (All is included to provide some context.)

import java.util.HashMap;
import java.util.Map;
import java.util.Random;

public class Test1 {

static int max_k1 = 500;
static int max_k2 = 500;

static Map<Node, Node> map;
static Random random = new Random();

public static void main(String[] args) {
    for (int i = 0; i < 15; i++) {
        long start = System.nanoTime();
        run();
        long end = System.nanoTime();
        System.out.println((end - start) / 1000_000);
    }
}

private static void run() {
    map = new HashMap<>();
    for (int i = 0; i < 10_000_000; i++) {
        Node key = new Node(random.nextInt(max_k1), random.nextInt(max_k2));
        Node val = getOrElseUpdate(key);
    }
}

private static Node getOrElseUpdate(Node key) {
    Node val;
    if ((val = map.get(key)) == null) {
        val = key;
        map.put(key, val);
    }
    return val;
}

private static class Node {

    private int k1;
    private int k2;

    public Node(int k1, int k2) {
        this.k1 = k1;
        this.k2 = k2;
    }

    @Override
    public int hashCode() {
        int result = 17;
        result = 31 * result + k1;
        result = 31 * result + k2;
        return result;
    }

    @Override
    public boolean equals(Object obj) {
        if (this == obj)
            return true;

        if (!(obj instanceof Node))
            return false;

        Node other = (Node) obj;

        return k1 == other.k1 && k2 == other.k2;
    }
  }
}

Upvotes: 1

Views: 1218

Answers (3)

RealSkeptic
RealSkeptic

Reputation: 34638

I was one of the people who told you it was poor. And I gave you the reason: "250,000 possible Node values it only has 15969 hash codes."

If your Node items are supposed to be more or less evenly distributed over the 0 ≤ k1 < 500 and 0 ≤ k2 < 500 range, then you have 250,000 possible node values.

A good hash function is supposed to give you hash codes that are as unique as possible for these 250,000 values. That is, ideally, a good hash function should give you a different value for each combination of k1 and k2.

Hash functions are not required to be unique, because in many cases this is impossible - if you have objects that have trillions and trillions of possible combinations, of course you cannot map all those combinations into different integers.

The standard hash function that you used is meant for that kind of object. If you have evenly distributed objects with a huge range of possibilities, that sort of hash function will end up using all possible integer values, and that's the best it can do.

But in your particular case, you have 250,000 combinations, which can easily be represented in a single integer, using the function 500 * k1 + k2. A completely unique hash function is ideal.

And the "standard" hash function which you used performs poorly because over such a small range of integers, it maps many of them to the same values, and you end up having only 15,969 unique hash codes. This means many of your Node objects will map into the same hash code. (250,000/15,969 for each code!). So you are going to have a lot of hash collisions.

The more hash collisions you have, the worse the performance of hash maps, because much of the hash maps' good performance relies on as few as possible keys in the same hash buckets. And hash buckets are determined by the hash code.

Upvotes: 6

Peter Lamberg
Peter Lamberg

Reputation: 8651

Your hash function can be written as 31 * 17 * 31 + 31 * k1 + k2.

You can see that adding 31 to k2 and -1 to k1 will yield the same hash value.

Then roughly every pair of numbers in range 1 to 500 will have about a dozen (500 / 31) other pairs with same hash.

A hash function performing perfectly in your sample code would be 500 * k1 + k2. (Quick test shows about 3x performance gain.)

As Louis Wasserman pointed out, using a well studied general hash function from a library is probably a safe bet.

As for why the standard array hash function performs poorly in this case (btw IntelliJ generates the same function by default.)

Not to claim a complete analysis here, but clearly larger the number of hashed variables is (assuming they are independent in some sense) and larger the set of possible values of each, the better the function performs. In your case the performance is poor, because there are only 2 variables and they both have small range.

It seems that in Java 8 the HashMap implementation got more complex, presumably it was optimized for better asymptotic performance in some scenarios. This small added complexity together with the poorly performing hash function causes the performance drop.

On that vein, it could be that linear probing hash map could be a better algorithm for you. Being a simpler structure and suffering less cache misses it should offer better performance in your read heavy workload. I'm my self interested in a Java library providing good general use linear probing hash map.

Upvotes: 4

Louis Wasserman
Louis Wasserman

Reputation: 198471

The issue is that it doesn't work well when the range of the inputs is small, frankly. It works fine when you have things like Strings, but not for small ints.

You might consider using a hashing algorithm like Murmur. If you can use third-party libraries like Guava, this might be

return Hashing.murmur3_32().newHasher().putInt(k1).putInt(k2).hash().asInt();

Upvotes: 2

Related Questions