Aaron Rotenberg
Aaron Rotenberg

Reputation: 1002

What is the preferred way to compare two Java Strings lexicographically on *Unicode code points*?

For a Java program I'm writing, I have a particular need to sort strings lexicographically by Unicode code point. This is not the same as String.compareTo() when you start dealing with values outside the Basic Multilingual Plane. String.compareTo() compares strings lexicographically on 16-bit char values. To see that this is not equivalent, note that U+FD00 ARABIC LIGATURE HAH WITH YEH ISOLATED FORM is less than U+1D11E MUSICAL SYMBOL G CLEF, but the Java String object "\uFD00" for the Arabic character compares greater than the surrogate pair "\uD834\uDD1E" for the clef.

I can manually loop along the code points using String.codePointAt() and Character.charCount() and do the comparison myself if necessary. Is there an API function or other more "canonical" way of doing this?

Upvotes: 16

Views: 1534

Answers (1)

jorgeu
jorgeu

Reputation: 689

Its called Collations. See https://docs.oracle.com/javase/tutorial/i18n/text/locale.html

Note that your database can sort your query results using collations too. See for example what mysql supports https://dev.mysql.com/doc/refman/5.0/en/charset-charsets.html

Upvotes: 1

Related Questions