Reputation: 26713
I am trying to understand how case sensitive order should really work using Java Collator.
In this example following strings are sorted in French locale using all strengths (I have added a few extra strings to the data set for illustrative purposes):
[Äbc, äbc, Àbc, àbc, Abc, abc, ABC] - Original Data
[Äbc, äbc, Àbc, àbc, Abc, abc, ABC] Primary
[Abc, abc, ABC, Àbc, àbc, Äbc, äbc] Secondary
[abc, Abc, ABC, àbc, Àbc, äbc, Äbc] Tertiary
Case kicks in only with Tertiary Collation Strength :
[CACHE, cache, Cache, da, DA, Da] - Original Data
[CACHE, cache, Cache, da, DA, Da] Primary
[CACHE, cache, Cache, da, DA, Da] Secondary
[cache, Cache, CACHE, da, Da, DA] Tertiary
But the result I was really expecting was this:
[abc, àbc, äbc, Abc, ABC, Àbc, Äbc] Tertiary
[cache, da, Cache, CACHE, Da, DA] Tertiary
In other words, I would like all lowercase go first (sorted alphabetically), followed by uppercase (or vice versa). Is this not a reasonable expectation?
Upvotes: 8
Views: 5110
Reputation: 52366
Another option: if you need to customize the rules of a locale, you can try using RuleBasedCollator:
RuleBasedCollator collTemp = (RuleBasedCollator) Collator.getInstance(Locale.US);
String usRules = collTemp.getRules();
//Remove dashes rule from US locale (dashes come after letters)
usRules = usRules.replace(",'-'", "");
//Create a collator with customized rules
RuleBasedCollator coll = new RuleBasedCollator(usRules);
//Sort the collection based on collator
Collections.sort(lines, coll);
Upvotes: 1
Reputation: 298143
You should not make assumptions about the resulting ordering of a locale-sensitive collator.
It’s not meant to reflect technical aspects like ASCII order but human language rules, e.g. as people would sort titles of books in a library or names in a phone book. You won’t find shelfs with uppercase books separated from shelfs with lowercase ones, usually.
To illustrate more surprising behavior, look at the following example:
String s1="IDONTCARE", s2="idontcare";
System.out.println("Comparing '"+s1+"' and '"+s2+"' locale sensitive");
Locale[] all={ Locale.ENGLISH, new Locale("tr") };
for(Locale l:all)
{
System.out.println();
System.out.println(l);
Collator c1=Collator.getInstance(l);
c1.setStrength(Collator.PRIMARY);
System.out.println("primary:\t"+c1.compare(s1, s2));
c1.setStrength(Collator.SECONDARY);
System.out.println("secondary:\t"+c1.compare(s1, s2));
c1.setStrength(Collator.TERTIARY);
System.out.println("tertiary:\t"+c1.compare(s1, s2));
c1.setStrength(Collator.IDENTICAL);
System.out.println("identical:\t"+c1.compare(s1, s2));
}
It will print:
Comparing 'IDONTCARE' and 'idontcare' locale sensitive
en
primary: 0
secondary: 0
tertiary: 1
identical: 1
tr
primary: -1
secondary: -1
tertiary: -1
identical: -1
As said, don’t expect to know the result and forget about the ASCII/Unicode lexicographic order with collators.
Upvotes: 0
Reputation: 9295
The sample code is working as intended. You can use custom collation rules to get the desired output.
RuleBasedCollator is the only subclass of Collator in JDK. Your call to Collator.getInstance(Locale.FRANCE) returns an instance of RuleBasedCollator
You could create your own instance using
RuleBasedCollator myCollator = new RuleBasedCollator(rules);
The format for rules is given in the javadoc.
Hope it helps.
Upvotes: 1
Reputation: 328598
Interestingly, the android javadoc is somewhat more helpful than the oracle one - in particular:
A tertiary difference is ignored when there is a primary or secondary difference anywhere in the strings.
Also worth noting: the order you get is what you would expect in French locale. According to the wikipedia article on "ordre alphabétique":
En première analyse, les caractères accentués, de même que les majuscules, ont le même rang alphabétique que le caractère fondamental.
Si plusieurs mots ont le même rang alphabétique, on tâche de les distinguer entre eux grâce aux majuscules et aux accents (pour le e, on a l'ordre e, é, è, ê, ë)
In English (my addition in italic):
The first step consists in ranking letters, regardless of their accentuation or case (i.e.: a,A,à rank the same). If several words have the same rank following the first step, case and accentuation are taken into account.
In other words, c
(small cap) and D
(large cap) will always be sortable with a Primary strength and the Tertiary strength won't change that order.
So in your example, you will always have cache
before da
, regardless of case and accents. Case will only make a difference if the primary letter is the same (c
(small) vs. C
(large) for example).
Upvotes: 4