Matthias
Matthias

Reputation: 110

Java hashcode() collision for objects containing different but similar Strings

While verifying output data of my program, I identified cases for which hash codes of two different objects were identical. To get these codes, I used the following function:

int getHash( long lID, String sCI, String sCO, double dSR, double dGR, String sSearchDate ) {

    int result = 17;
    result = 31 * result + (int) (lID ^ (lID >>> 32));
    long temp;
    temp = Double.doubleToLongBits(dGR);
    result = 31 * result + (int) (temp ^ (temp >>> 32));
    temp = Double.doubleToLongBits(dSR);
    result = 31 * result + (int) (temp ^ (temp >>> 32));
    result = 31 * result + (sCI != null ? sCI.hashCode() : 0);
    result = 31 * result + (sCO != null ? sCO.hashCode() : 0);
    result = 31 * result + (sSearchDate != null ? sSearchDate.hashCode() : 0);

    return result;
}

These are two example cases:

getHash( 50122,"03/25/2015","03/26/2015",4.0,8.0,"03/24/15 06:01" )
getHash( 51114,"03/24/2015","03/25/2015",4.0,8.0,"03/24/15 06:01" )

I suppose, this issue arises, as I have three very similar strings present in my data, and the difference in the hashcode between String A to B and B to C are of the same size, leading to an identical returned hashcode.

The proposed hashcode() implementation by IntelliJ is using 31 as a multiplier for each variable that contributes to the final hashcode. I was wondering why one is not using different values for each variable (like 33, 37, 41 (which I have seen mentioned in other posts dealing with hashcodes))? In my case, this would lead to a differentiation between my two objects.

But I'm wondering whether this could then lead to issues in other cases?

Any ideas or hints on this? Thank you very much!

Upvotes: 3

Views: 1282

Answers (3)

Anderson Vieira
Anderson Vieira

Reputation: 9049

The hashCode() contract allows different objects to have the same hash code. From the documentation:

It is not required that if two objects are unequal according to the equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.

But, since you've got a bunch of parameters for your hash, you may consider using Objects.hash() instead of doing your own implementation:

@Override
int getHash(long lID, String sCI, String sCO, double dSR, double dGR, String sSearchDate) {
    return Objects.hash(lID, sCI, sCO, dSR, dGR, sSearchDate);
}

For example:

Objects.hash(50122, "03/25/2015", "03/26/2015", 4.0, 8.0, "03/24/15 06:01")
Objects.hash(51114, "03/24/2015", "03/25/2015", 4.0, 8.0, "03/24/15 06:01")

Results in:

-733895022
-394580334

Upvotes: 5

juhist
juhist

Reputation: 4314

Ash the other answers explain well, it is allowed for hashCode to return same values for different objects. This is not a cryptographic hash value so it's easy to find examples of hashCode collisions.

However, I point out a problem in your code: if you have made the hashCode method yourself, you should definitely be using a better hash algorithm. Take a look at MurmurHash: http://en.wikipedia.org/wiki/MurmurHash. You want to use the 32-bit version. There are also Java implementations.

Yes, hash collisions can lead to performance issues. Therefore it's important to use a good hash algorithm. Additionally, for security MurmurHash allows a seed value to make hash collision denial of service attacks harder. You should generate that seed value you use randomly on the start of the program. Your implementation of the hashCode method is vulnerable to these hash collision DoS attacks.

Upvotes: 1

Sebastian
Sebastian

Reputation: 415

The code shown by your may add zero for example by

result = 31 * result + (sCI != null ? sCI.hashCode() : 0);

When adding some zeros this may degenerate to a multiplation of

31 * 31 * 31 ...

which could destroy uniqueness.

However the hashCode method is not intended to return unique values. It simply should provide a uniform distribution of values and it should be easy to compute (or cache hashCode as the String class does).

From a more theoretical point of view a hashCode maps from a large set A into a smaller set B. Hence collisions (different elements from A map to the same value in B) are unavoidable. You could choose a set B which is bigger than A but this would violate the purpose of hashCode: performance optimization. Really, you could achieve anything with a linked list and some additional logic what you achieve with hashCode.

Prime numbers are chosen as they result in a better distribution. For example if using none primes 4*3 = 12 = 2*6 result in the same hashCode. The 31 is sometimes chosen as it is a Mersenne prime number 2^n-1 which is said to perform better on processors (I'm not sure about that).

As the hashCode method is specified not return unambiguously identify elements non-unique hashCodes are perfectly fine. Assuming uniqueness of hashCodes is a bug.

However a HashMap can be described as a set of buckets with each bucket holding a single linked list of elements. The buckets are indexed by the hashCode. Hence providing identical hashCodes leads to less buckets with longer lists. In the most extreme case (returning an arbitrary constant as hashCode) the map degenerates to a linked list.

When an object is searched in a hash data structure, the hashCode is used to get the bucket index. For each objetc in this bucket the equals method is invoked -> long lists means a large number of invocations of equals.

Conclusion: Assuming that the hashCode method is correctly used this can not cause a program to malfunction. However it may result in a severe performance penalty.

Upvotes: 1

Related Questions