Reputation: 10762
I've been testing alphabetical sorting in Chinese (if I may call it so). This is how Excel sorts some example words:
啊<波<词<的<俄<佛<歌<和<及<课<了<馍<呢<票<气<日<四<特<瓦<喜<以<只
0<2<85<!<@<版本<标记<成员<错误<导出<导航<Excel 文件<访问<分类<更改<规则<HTML<基本<记录<可选<快捷方式<类别<历史记录<密码<目录<内联<内容<讨论<文件<页面<只读
and this is what came out of Collections.sort(list, simplified_chinese_collator_comparator)
(the first offending character in bold):
啊<波<词<的<俄<佛<歌<和<及<课<了<呢<票<气<日<四<特<瓦<喜<以<只<馍
!<@<0<2<85<Excel 文件<HTML<版本<标记<成员<错误<导出<导航<访问<分类<更改<规则<基本<记录 <可选<快捷方式<类别<历史记录<密码<目录<内联<内容<讨论<文件<页面<只读
I don't know anything about Chinese. Does anyone know why Collator
output it's different, or what is it based on?
Are there any other libraries for language-based sorting?
Upvotes: 3
Views: 2898
Reputation: 18662
Why it is different? Because there are several different methods of sorting ideographic characters or even entire words. The ones that stuck in my mind are:
There are other methods as well, for example Unicode Technical Report #35 mentions some of them (more by coincidence, not necessary on purpose), but you'd have to have plenty of time to go through it.
To answer your question, on why these sorting orders are different, it just because Java contains its own collation rules and it does not rely on Operating System's ones (as Excel does). These rules might be different. You might also want to try out ICU, which is the source of classes and rules in Java (and is usually a step ahead than JDK).
Upvotes: 3
Reputation: 533510
There isn't a Collator in Java 6 or 7 which will sort the Chinese in the same order as the first sample.
public static void main(String... args) {
String text1 = "啊<波<词<的<俄<佛<歌<和<及<课<了<馍<呢<票<气<日<四<特<瓦<喜<以<只";
findLocaleForSortedOrder(text1);
String text2 = "啊<波<词<的<俄<佛<歌<和<及<课<了<呢<票<气<日<四<特<瓦<喜<以<只<馍";
findLocaleForSortedOrder(text2);
}
private static void findLocaleForSortedOrder(String text) {
System.out.println("For " + text + " found...");
String[] preSorted = text.split("<");
for (Locale locale : Collator.getAvailableLocales()) {
String[] sorted = preSorted.clone();
Arrays.sort(sorted, Collator.getInstance(locale));
if (Arrays.equals(preSorted, sorted))
System.out.println("Locale " + locale + " has the same sorted order");
}
System.out.println();
}
prints
For 啊<波<词<的<俄<佛<歌<和<及<课<了<馍<呢<票<气<日<四<特<瓦<喜<以<只 found...
For 啊<波<词<的<俄<佛<歌<和<及<课<了<呢<票<气<日<四<特<瓦<喜<以<只<馍 found...
Locale zh_CN has the same sorted order
Locale zh has the same sorted order
Locale zh_SG has the same sorted order
Upvotes: 3