Cheng.T
Cheng.T

Reputation: 346

Why was the default hashCode generation in jvm switched to xor-shift in JDK 8?

I am learning JVM code to understand Java deeper. In synchronizer.cpp (in the get_next_hash method) there is a comment that says:

// Marsaglia's xor-shift scheme with thread-specific state
// This is probably the best overall implementation -- we'll
// likely make this the default in future releases.

This is found in the in else branch when variable hashcode is not any of (0,1,2,3,4). this variable can be set via the JVM option "-XX:hashcode=n".

I wrote some code to test those hash algorithms:

public static void main(String[] args) {
    long now = System.currentTimeMillis();
    RuntimeMXBean runtimeMxBean = ManagementFactory.getRuntimeMXBean();
    List<String> arguments = runtimeMxBean.getInputArguments();
    for(String s:arguments){
        System.out.println(s);
    }

    HashMap<Integer,Object> h = new HashMap<>();

    ArrayList<Object> arrayList = new ArrayList<>();

    for(int i=0;i<2000000;i++){
        Object o = new Object();
        if(h.containsKey(o.hashCode())){
            arrayList.add(o);
            continue;
        }
        h.put(o.hashCode(),o);
    }
    long currentTimeMillis = System.currentTimeMillis();
    System.err.println("hashcode collision:"+arrayList.size());
    System.err.println(" used time "+(currentTimeMillis - now));

}

Before I ran the test, I expected I will get the fewest collisions when I set "-XX:hashcode=5", but that's not the case:

| n | algorithm      |collisions|
|---|----------------|----------|
| 0 | rondom         |        0 |
| 1 | addr-XOR-SHIFT |        0 |
| 2 | invarible-one  |  1999999 |
| 3 | autoincrease   |        0 |
| 4 | addr           |    23511 |
| 5 | xor-shift      |      962 |

Then I set times to 20000000 and addr-XOR-SHIFT is still 0. My question is: is it better then xor-shift? Why does jdk-8 make "-XX:hashcode=5" the default?

Upvotes: 4

Views: 463

Answers (1)

apangin
apangin

Reputation: 98334

The properties of a good hash function include 1) randomness, 2) uniformity, 3) performance, 4) scalability. The small number of collisions does not mean the hash function is random enough, e.g. in your test sequential hashCodes also give no collisions, but obviously this is not a good hash function.

Also, you tested only a single thread case. With a single thread, -XX:hashCode=0 (Park-Miller RNG algorithm which was default before JDK 8) behaves quite good. However, it becomes awful in a highly concurrent applications: the performance gets poor because of a high contention on a global varible, and the chances of generating the same hashCode in different threads increase because of the race condition, see the comment in the source code:

  if (hashCode == 0) {
     // This form uses an unguarded global Park-Miller RNG,
     // so it's possible for two threads to race and generate the same RNG.
     // On MP system we'll have lots of RW access to a global, so the
     // mechanism induces lots of coherency traffic.
     value = os::random() ;

-XX:hashCode=1 is also far from being perfect in terms of randomness. It just XORs the address of on object with a global variable updated only at stop-the-world JVM pauses:

  if (hashCode == 1) {
     // This variation has the property of being stable (idempotent)
     // between STW operations.  This can be useful in some of the 1-0
     // synchronization schemes.
     intptr_t addrBits = cast_from_oop<intptr_t>(obj) >> 3 ;
     value = addrBits ^ (addrBits >> 5) ^ GVars.stwRandom ;

You may find the discussion and the analysis of the different hashCode algorithms in this email thread.

In short, only -XX:hashCode=0 and -XX:hashCode=5 provide a good randomness, while the latter is far more scalable and performant, since it uses only simple bitwise operations and does not update global variables.

Upvotes: 6

Related Questions