Jonathan Scialpi
Jonathan Scialpi

Reputation: 841

What is string lexicographically? Java

The compareTo() method in Java compares two strings "lexicographically". Can someone please simply explain how the lexicographic comparison works in java?

I found this post that explains the three cases of <0 , ==0, and >0 ; However, I am still confused...

Does this mean that the int returned is the number of places away the strings are form one another if they were to be sorted alphabetically like a dictionary?

Also, how does the method deal with case sensitivity? Are lower case letters first in line before uppercase? Is there a chart for this?

For example, the below code produces an output of -31. Does this mean that the string Dog is -31 places away from the string cat?

public static void main(String[] args) {
     Scanner keyboard = new Scanner(System.in);   

     String str1 = "Dog";

     String str2 = "cat";

     int result = str1.compareTo(str2);
     System.out.println(result);

Upvotes: 5

Views: 14777

Answers (2)

Kylar
Kylar

Reputation: 9334

I found Wikipedia's Definition of Lexicographical order very useful in answering your question.

In a simplistic manner, the comparison is a numeric result from doing an alphabetic comparison. In alphabetic comparison, we compare the ordered set of letters that make up a sequence (usually words or strings). The return value will be 0 if the two are equal, and < or > depending on which value is alphabetically before or after the other.

take a list of words:

  • cat
  • dog
  • animal
  • aardvark

If we compare these, we take the first character of each and look. When we compare 'cat' and 'dog', we take the first character 'c' and 'd' and compare them. Numerically in code, the easy(not necessarily best) way to do this is to convert them to a numeric value and subtract one value from the other. This will equal 0 if they are the same, and we'll go to comparing the next character in each. If they're different, then we know that one is lexicographically (alphabetically) after the other.

The return value is not required to have any insightful information. That's why the only values that mean anything are <0 , ==0, and >0.

With regards to casing, that is an implementation detail - There are comparators that will consider an upper case 'A' to be the same as a lower case 'a', and there are comparators that don't, since they have different numerical values. (See: How to sort alphabetically while ignoring case sensitive?).

Upvotes: 1

Jean-Fran&#231;ois Savard
Jean-Fran&#231;ois Savard

Reputation: 21004

The value returned does not really matter as the compareTo contract is to return negative, positive or 0 (as you already know).

However, if really you want to understand why -31 is returned when comparing Dog with cat (or any other string) then you could simply look at the method directly in String class :

public int compareTo(String anotherString) {
    int len1 = value.length;
    int len2 = anotherString.value.length;
    int lim = Math.min(len1, len2);
    char v1[] = value;
    char v2[] = anotherString.value;

    int k = 0;
    while (k < lim) {
        char c1 = v1[k];
        char c2 = v2[k];
        if (c1 != c2) {
            return c1 - c2;
        }
        k++;
    }
    return len1 - len2;
}

Keep in mind that value is the char array backing the string.

private final char value[];

So how does this method proceed ?

  • You retrieve the minimum of both string length in a variable lim.
  • You create a copy of both string char array.
  • You loop over each characters (verifying if they are equals) until reaching the lowest limit.
  • If two characters at same index are not equals, you return the result of substracting the second one to the first. The char can be represented as int value (which take their ascii value) and are already ordered. Thus when substracting a negative number will be returned if the second char is "higher" then the first one. A positive will be returned if the second char is "lower" then the first one. 0 will be returned if both are equals.
  • If all characters were equals while looping for the lowest string length, you return a substraction of both length.

In your example, first letter of both words are not equals so you get to compare D with c which are respectively represented as 68 and 99. Substract 99 to 68 and you get -31.

So to answer this question :

Does this mean that the int returned is the number of places away the strings are form one another if they were to be sorted alphabetically like a dictionary?

No, it is actually either the difference between two non-matching char's ascii value or the difference of both length.

Also, how does the method deal with case sensitivity? Are lower case letters first in line before uppercase? Is there a chart for this?

If you want to ignore the case when comparing, you can use String#compareToIgnoreCase.

Also you can check this chart for ascii values (upper and lower case).

Upvotes: 5

Related Questions