Peter Kleiweg
Peter Kleiweg

Reputation: 3

Handling strings with high Unicode codepoints (above U+FFFF)

In Kotlin, how can I iterate over a string that contains Unicode characters above U+FFFF?

Example code:

val s = "Hëllø! € 😀"
for (c in s) {
    println("$c ${c.code}")
}

Actual output:

H 72
ë 235
l 108
l 108
ø 248
! 33
  32
€ 8364
  32
� 55357
� 56832

Desired output:

H 72
ë 235
l 108
l 108
ø 248
! 33
  32
€ 8364
  32
😀 128512

Upvotes: 0

Views: 60

Answers (1)

k314159
k314159

Reputation: 11276

In Kotlin/JVM, strings are encoded in UTF-16. (To be precise, they may be encoded internally using Latin1 but they still behave externally as if they're encoded in UTF-16.) This means that they're made up of 16-bit characters. To get the actual Unicode code points, including those above U+FFFF, you can use Java's codePoints() method:

val s = "Hëllø! € 😀"
for (cp in s.codePoints()) {
    println("${buildString { appendCodePoint(cp) }} $cp")
}

Output:

H 72
ë 235
l 108
l 108
ø 248
! 33
  32
€ 8364
  32
😀 128512

However, be aware of the presence of combining characters, where multiple Unicode code points are used to make a single grapheme. If you want to support combining characters, then my answer will not help: you will need to look at Sweeper's answer in this question instead.

Unfortunately, on other platforms, Kotlin currently doesn't make it easy to handle Unicode. See this discussion for a list of currently open issues.

Upvotes: 1

Related Questions