Reputation: 3

Handling strings with high Unicode codepoints (above U+FFFF)

In Kotlin, how can I iterate over a string that contains Unicode characters above U+FFFF?

Example code:

val s = "Hëllø! € 😀"
for (c in s) {
    println("$c ${c.code}")
}

Actual output:

Desired output:

Upvotes: 0

Answers (1)

k314159

Reputation: 11276

In Kotlin/JVM, strings are encoded in UTF-16. (To be precise, they may be encoded internally using Latin1 but they still behave externally as if they're encoded in UTF-16.) This means that they're made up of 16-bit characters. To get the actual Unicode code points, including those above U+FFFF, you can use Java's codePoints() method:

val s = "Hëllø! € 😀"
for (cp in s.codePoints()) {
    println("${buildString { appendCodePoint(cp) }} $cp")
}

Output:

However, be aware of the presence of combining characters, where multiple Unicode code points are used to make a single grapheme. If you want to support combining characters, then my answer will not help: you will need to look at Sweeper's answer in this question instead.

Unfortunately, on other platforms, Kotlin currently doesn't make it easy to handle Unicode. See this discussion for a list of currently open issues.

Upvotes: 1

Handling strings with high Unicode codepoints (above U+FFFF)

Answers (1)

Related Questions