What are good Unicode codepoints to test with requiring UTF-16 surrogate pairs?

Question

Many programming systems like ICU, Java, COM, and the CLR use UTF-16 to encode string data during processing. It is relatively difficult to expose bugs in these systems because characters in common use are inside the Basic Multilingual Plane, and as such only require one encoding unit to encode in UTF-16.

Previously I have used emoji characters, e.g. 🏈, to verify that things are working correctly; but I'm in a situation where the parser in question rejects non-alphabetic characters and as a result categorically rejects the emoji I tried to use.

What good/recognizable examples in the various Unicode categories that I can use to write good tests?

David Conrad · Accepted Answer

The Deseret alphabet which the Mormons developed in the 19th century is encoded outside the BMP but is made up of characters that are considered alphabetic in Unicode, and unlike some other, ancient scripts such as Ugaritic or Egyptian Hieroglyphics, Deseret is a cased script meaning that there are uppercase and lowercase variants of each letter.

Deseret Unicode block, U+10400 - U+1044F (PDF)

Testing with Deseret reveals some flaws in Java's handling of Unicode. For example, s1.equalsIgnoreCase(s2) where s1 and s2 are strings containing upper- and lowercase versions of the same Deseret letters returns false because the equalsIgnoreCase method doesn't correctly handle surrogate pairs.

Edited to add: I just discovered another one by looking over the Unicode code charts: "Warang Citi", or as Wikipedia spells it, "Varang Kshiti", the script of the Ho language. It's a cased script for a language spoken by about a million people in India.

Warang Citi Unicode block, U+118A0 - U+118FF (PDF)

Ancient scripts that don't distinguish case are also typically outside the BMP, such as Lydian, Phoenician, and Aramaic.

What are good Unicode codepoints to test with requiring UTF-16 surrogate pairs?

Answers (1)

Related Questions