digory doo
digory doo

Reputation: 2311

Kotlin four-byte unicode literals?

How can I declare a Char range in Kotlin that encloses a four-byte range?

private val CJK_IDEOGRAPHS_EXT_A = '\u3400' .. '\u4DBF'    // OK
private val CJK_IDEOGRAPHS_EXT_B = '\u20000' .. '\u2A6DF'  // doesn't compile

I tried the following hack, but I get the warning, "this cast can never succeed":

private val CJK_IDEOGRAPHS_EXT_B: CharRange = 0x20000 as Char .. 0x2A6DF as Char

Basically I want to implement a function like this:

fun isCJK(c: Char): Boolean {
    return c in CJK_RADICALS ||
        c in CJK_SYMBOLS ||
        c in CJK_STROKES ||
        c in CJK_ENCLOSED ||
        c in CJK_IDEOGRAPHS ||
        c in CJK_COMPAT ||
        c in CJK_COMPAT_IDEOGRAPHS ||
        c in CJK_COMPAT_FORMS ||
        c in CJK_IDEOGRAPHS_EXT_A
        // EXT_B not working
        // EXT_C not working
        // EXT_D not working
        // EXT_E not working
        // EXT_F not working
}

I'm using Kotlin under Android.

Upvotes: 2

Views: 1468

Answers (1)

Alexey Romanov
Alexey Romanov

Reputation: 170839

On JVM, Char is a 16 bit code unit and so the maximum code point it can represent is 0xFFFF; the ranges you mention are represented by surrogate pairs. So your function should take a String instead, e.g.

private val CJK_IDEOGRAPHS_EXT_B: IntRange = 0x20000 .. 0x2A6DF 
...

fun isCJK(s: String): Boolean {
    if (s.codePointCount(0, s.length) > 1) 
        throw new IllegalArgumentException("String \"$s\" contains more than 1 codepoint")
    val c = s.codePointAt(0)
    return c in CJK_RADICALS ||
        c in CJK_SYMBOLS ||
        c in CJK_STROKES ||
        c in CJK_ENCLOSED ||
        c in CJK_IDEOGRAPHS ||
        c in CJK_COMPAT ||
        c in CJK_COMPAT_IDEOGRAPHS ||
        c in CJK_COMPAT_FORMS ||
        c in CJK_IDEOGRAPHS_EXT_A ||
        c in CJK_IDEOGRAPHS_EXT_B || ...
}

Java 9 has a much more convenient IntStream codePoints() method, but it doesn't seem to be available on Android.

Upvotes: 2

Related Questions