mindas
mindas

Reputation: 26713

Case sensitive order using Java Collator

I am trying to understand how case sensitive order should really work using Java Collator.

In this example following strings are sorted in French locale using all strengths (I have added a few extra strings to the data set for illustrative purposes):

[Äbc, äbc, Àbc, àbc, Abc, abc, ABC] - Original Data
[Äbc, äbc, Àbc, àbc, Abc, abc, ABC] Primary
[Abc, abc, ABC, Àbc, àbc, Äbc, äbc] Secondary
[abc, Abc, ABC, àbc, Àbc, äbc, Äbc] Tertiary

Case kicks in only with Tertiary Collation Strength  : 
[CACHE, cache, Cache, da, DA, Da] - Original Data
[CACHE, cache, Cache, da, DA, Da] Primary
[CACHE, cache, Cache, da, DA, Da] Secondary
[cache, Cache, CACHE, da, Da, DA] Tertiary

But the result I was really expecting was this:

[abc, àbc, äbc, Abc, ABC, Àbc, Äbc] Tertiary
[cache, da, Cache, CACHE, Da, DA] Tertiary

In other words, I would like all lowercase go first (sorted alphabetically), followed by uppercase (or vice versa). Is this not a reasonable expectation?

Upvotes: 8

Views: 5110

Answers (4)

live-love
live-love

Reputation: 52366

Another option: if you need to customize the rules of a locale, you can try using RuleBasedCollator:

    RuleBasedCollator collTemp = (RuleBasedCollator) Collator.getInstance(Locale.US);

    String usRules = collTemp.getRules();

    //Remove dashes rule from US locale (dashes come after letters)
    usRules = usRules.replace(",'-'", "");

    //Create a collator with customized rules    
    RuleBasedCollator coll = new RuleBasedCollator(usRules);

    //Sort the collection based on collator
    Collections.sort(lines, coll);

Upvotes: 1

Holger
Holger

Reputation: 298143

You should not make assumptions about the resulting ordering of a locale-sensitive collator.

It’s not meant to reflect technical aspects like ASCII order but human language rules, e.g. as people would sort titles of books in a library or names in a phone book. You won’t find shelfs with uppercase books separated from shelfs with lowercase ones, usually.

To illustrate more surprising behavior, look at the following example:

String s1="IDONTCARE", s2="idontcare";
System.out.println("Comparing '"+s1+"' and '"+s2+"' locale sensitive");
Locale[] all={ Locale.ENGLISH, new Locale("tr") };
for(Locale l:all)
{
  System.out.println();
  System.out.println(l);
  Collator c1=Collator.getInstance(l);
  c1.setStrength(Collator.PRIMARY);
  System.out.println("primary:\t"+c1.compare(s1, s2));
  c1.setStrength(Collator.SECONDARY);
  System.out.println("secondary:\t"+c1.compare(s1, s2));
  c1.setStrength(Collator.TERTIARY);
  System.out.println("tertiary:\t"+c1.compare(s1, s2));
  c1.setStrength(Collator.IDENTICAL);
  System.out.println("identical:\t"+c1.compare(s1, s2));
}

It will print:

Comparing 'IDONTCARE' and 'idontcare' locale sensitive

en
primary:    0
secondary:  0
tertiary:   1
identical:  1

tr
primary:    -1
secondary:  -1
tertiary:   -1
identical:  -1

As said, don’t expect to know the result and forget about the ASCII/Unicode lexicographic order with collators.

Upvotes: 0

krishnakumarp
krishnakumarp

Reputation: 9295

The sample code is working as intended. You can use custom collation rules to get the desired output.

RuleBasedCollator is the only subclass of Collator in JDK. Your call to Collator.getInstance(Locale.FRANCE) returns an instance of RuleBasedCollator

You could create your own instance using

RuleBasedCollator myCollator = new RuleBasedCollator(rules);

The format for rules is given in the javadoc.

Hope it helps.

Upvotes: 1

assylias
assylias

Reputation: 328598

Interestingly, the android javadoc is somewhat more helpful than the oracle one - in particular:

A tertiary difference is ignored when there is a primary or secondary difference anywhere in the strings.

Also worth noting: the order you get is what you would expect in French locale. According to the wikipedia article on "ordre alphabétique":

En première analyse, les caractères accentués, de même que les majuscules, ont le même rang alphabétique que le caractère fondamental.
Si plusieurs mots ont le même rang alphabétique, on tâche de les distinguer entre eux grâce aux majuscules et aux accents (pour le e, on a l'ordre e, é, è, ê, ë)

In English (my addition in italic):

The first step consists in ranking letters, regardless of their accentuation or case (i.e.: a,A,à rank the same). If several words have the same rank following the first step, case and accentuation are taken into account.

In other words, c (small cap) and D (large cap) will always be sortable with a Primary strength and the Tertiary strength won't change that order.

So in your example, you will always have cache before da, regardless of case and accents. Case will only make a difference if the primary letter is the same (c (small) vs. C (large) for example).

Upvotes: 4

Related Questions