Mary Ryllo
Mary Ryllo

Reputation: 2371

Why does the equals method in String not use hash?

The code of the equals method in class String is

public boolean equals(Object anObject) {
    if (this == anObject) {
        return true;
    }
    if (anObject instanceof String) {
        String anotherString = (String)anObject;
        int n = count;
        if (n == anotherString.count) {
            char v1[] = value;
            char v2[] = anotherString.value;
            int i = offset;
            int j = anotherString.offset;
            while (n-- != 0) {
                if (v1[i++] != v2[j++])
                    return false;
            }
            return true;
        }
    }
    return false;
}

I have a question - why does this method not use hashCode() ?

As far as I know, hashCode() can compare two strings rapidly.

UPDATE: I know, that two unequal strings, can have same hashes. But two equal strings have equal hashes. So, by using hashCode(), we can immediately see that two strings are unequal.

I'm simply thinking that using hashCode() can be a good filter in equals.

UPDATE 2: Here some code, about we are talking here.

It is an example how String method equals can look like

public boolean equals(Object anObject) {
    if (this == anObject) {
        return true;
    }
    if (anObject instanceof String) {
        String anotherString = (String)anObject;
        if (hashCode() == anotherString.hashCode()){
            int n = count;
            if (n == anotherString.count) {
                char v1[] = value;
                char v2[] = anotherString.value;
                int i = offset;
                int j = anotherString.offset;
                while (n-- != 0) {
                    if (v1[i++] != v2[j++])
                        return false;
                }
                return true;
            }
        }else{
            return false;
        }
    }
    return false;
}

Upvotes: 46

Views: 6221

Answers (8)

assylias
assylias

Reputation: 328649

This question has actually been considered by the developers of the JDK. I could not find in the various messages why it has not been included. The enhancement is also listed in the bug database.

Namely, one of the proposed change is:

public boolean equals(Object anObject) {
    if (this == anObject) // 1st check identitiy
        return true;
    if (anObject instanceof String) { // 2nd check type
        String anotherString = (String)anObject;
        int n = count;
        if (n == anotherString.count) { // 3rd check lengths
            if (n != 0) { // 4th avoid loading registers from members if length == 0
                int h1 = hash, h2 = anotherString.hash;
                if (h1 != 0 && h2 != 0 && h1 != h2) // 5th check the hashes
                    return false;

There was also a discussion to use == for interned strings (i.e. if both strings are interned: if (this != anotherString) return false;).

Upvotes: 16

gnasher729
gnasher729

Reputation: 52538

If the hash code takes the whole content of the string into account, then calculating the hash code of a string with n characters takes n operations. For long strings that's a lot. Comparing two strings takes n operations if they are the same, not longer than calculating the hash. But if the strings are different, then a difference is likely to be found a lot earlier.

String hash functions usually don't consider all characters for very long strings. In that case, if I compare two strings I could first compare the characters used by the hash function, and I'm at least as fast as checking the hashes. But if there is no difference in these characters, then the hash value will be the same, so I have to compare the full strings anyway.

Summary: A good string comparison is never slower but often a lot faster than comparing the hashes (and comparing strings when the hashes match).

Upvotes: 0

tb-
tb-

Reputation: 1290

As I think, hashCode() can make comparison of two strings quicker.

Arguments?

Arguments against this proposal:

  • More Operations

hashcode() from String has to access every character in the String and has to do 2 calculations for every character.
So we need for a string with n characters 5*n operations (load, multiplication, lookup/load, multiplication, store). Two times, because we compare two Strings. (Ok, one store and one load does not really count in a reasonable implementation.)
For the best case, this makes a total of 10*x operations for two strings with length m and n and x=min(m,n). Worst case is 10*x with x=m=n. Average somewhere between, perhaps (m*n)/2.

The current equals implementation needs in the best case 3 operations. 2 loads, 1 compare. Worst is 3*x operations for two strings with length m and n and x=m=n. Average is somewhere between, perhaps 3*(m*n)/2.

  • Even if we cache the hash, it is not clear that we save something

We have to analyze usage patterns. It could be that most of the time, we will only ask one time for equals, not multiple times. Even if we ask multiple times, it could not be enough to have time savings from the caching.

Not direct against the speed, but still good counterarguments:

  • Counter intuitive

We do not expect a hashcode in equals, because we know for sure that hash(a)==hash(b) for some a!=b. Everyone reading this (and knowledge about hashing) will wonder what is happening there.

  • Leads to bad examples/unexpected behavior

I can already see the next question on SO: "I have a String with some billion times 'a'. Why does it take forever to compare it with equal() against 'b'?" :)

Upvotes: 0

irreputable
irreputable

Reputation: 45443

This can be a good idea for many use cases.

However, as a foundation class that is widely used in all kinds of applications, the author really has no idea whether this extra checking can save or hurt performance on average.

I'm gonna guess that the majority of String.equals() are invoked in a Hashmap, after the hash codes has been known to be equal, so testing hash codes again is pointless.

If we consider comparing 2 random strings, even with a small char set like US ASCII, it is very likely that the hashes are different, and char-by-char comparison fails on 1st char. So it'll be a waste to check hashes.

Upvotes: 4

Peter Lawrey
Peter Lawrey

Reputation: 533530

AFAIK, The following check could be added to String. This check that if the hash codes are set and they are different, then the Strings cannot be equal.

if (hash != 0 && anotherString.hash != 0 && hash != anotherString.hash)
    return false;
if (hash32 != 0 && anotherString.hash32 != 0 && hash32 != anotherString.hash32)
    return false;

Upvotes: 2

Audrius Meškauskas
Audrius Meškauskas

Reputation: 21748

The string hash code is not available for free and automatically. In order to rely on hash code, it must be computed for both strings and only then can be compared. As collisions are possible, the second char-by-char comparison is required if the hash codes are equal.

While String appears as immutable for the usual programmer, it does have the private field to store its hashcode once it is computed. However this field is only computed when hashcode is first required. As you can see from the String source code here:

 private int hash;

 public int hashCode() {
      int h = hash;
      if (h == 0) {
         ...
         hash = h;  
      }
      return h;
 }

Hence it is not obvious that it makes sense to compute the hashcode first. For your specific case (maybe same instances of really long strings are compared with each other a really many of times), it still may be: profile.

Upvotes: 0

parsifal
parsifal

Reputation: 1663

Hashcode could be a first-round check for inequality. However, it presents some tradeoffs.

  1. String hashcodes are lazily calculated, although they do use a "guard" value. If you're comparing strings with long lifetimes (ie, they're likely to have had the hashcode computed), this isn't a problem. Otherwise, you're stuck with either computing the hashcode (potentially expensive) or ignoring the check when the hashcode hasn't been computed yet. If you have a lot of short-lived strings, you'll be ignoring the check more often than you'll be using it.
  2. In the real world, most strings differ in their first few characters, so you won't save much by checking hashcode first. There are, of course, exceptions (such as URLs), but again, in real world programming they occur infrequently.

Upvotes: 41

MrSmith42
MrSmith42

Reputation: 10151

1) Calculating hashCode may not be faster than comparing the Strings directly.

2) if the hashCode is equal, the Strings may still not be equal

Upvotes: 7

Related Questions