Reputation: 3
In Kotlin, how can I iterate over a string that contains Unicode characters above U+FFFF?
Example code:
val s = "Hëllø! € 😀"
for (c in s) {
println("$c ${c.code}")
}
Actual output:
H 72
ë 235
l 108
l 108
ø 248
! 33
32
€ 8364
32
� 55357
� 56832
Desired output:
H 72
ë 235
l 108
l 108
ø 248
! 33
32
€ 8364
32
😀 128512
Upvotes: 0
Views: 60
Reputation: 11276
In Kotlin/JVM, strings are encoded in UTF-16. (To be precise, they may be encoded internally using Latin1 but they still behave externally as if they're encoded in UTF-16.) This means that they're made up of 16-bit characters. To get the actual Unicode code points, including those above U+FFFF, you can use Java's codePoints() method:
val s = "Hëllø! € 😀"
for (cp in s.codePoints()) {
println("${buildString { appendCodePoint(cp) }} $cp")
}
Output:
H 72
ë 235
l 108
l 108
ø 248
! 33
32
€ 8364
32
😀 128512
However, be aware of the presence of combining characters, where multiple Unicode code points are used to make a single grapheme. If you want to support combining characters, then my answer will not help: you will need to look at Sweeper's answer in this question instead.
Unfortunately, on other platforms, Kotlin currently doesn't make it easy to handle Unicode. See this discussion for a list of currently open issues.
Upvotes: 1