Reputation: 1002
For a Java program I'm writing, I have a particular need to sort strings lexicographically by Unicode code point. This is not the same as String.compareTo()
when you start dealing with values outside the Basic Multilingual Plane. String.compareTo()
compares strings lexicographically on 16-bit char
values. To see that this is not equivalent, note that U+FD00 ARABIC LIGATURE HAH WITH YEH ISOLATED FORM is less than U+1D11E MUSICAL SYMBOL G CLEF, but the Java String
object "\uFD00"
for the Arabic character compares greater than the surrogate pair "\uD834\uDD1E"
for the clef.
I can manually loop along the code points using String.codePointAt()
and Character.charCount()
and do the comparison myself if necessary. Is there an API function or other more "canonical" way of doing this?
Upvotes: 16
Views: 1534
Reputation: 689
Its called Collations. See https://docs.oracle.com/javase/tutorial/i18n/text/locale.html
Note that your database can sort your query results using collations too. See for example what mysql supports https://dev.mysql.com/doc/refman/5.0/en/charset-charsets.html
Upvotes: 1