Java: UTF-8 byte length of a single code point (surrogates again)

Question

It all started as a pretty basic question: Given a char -- or rather, an integer code point, see Character API --, return the number of bytes required for its UTF-8 encoding. However, the more time I spent with this innocent little problem, the more confusing it became.

My first approach was:

int getUtf8ByteCount_stdlib(int codePoint) {
    int[] codePoints = { codePoint };
    String string = new String(codePoints, 0, 1);
    byte[] bytes = string.getBytes(StandardCharsets.UTF_8);
    return bytes.length;
}

Or for those who like it:

int getUtf8ByteCount_obfuscated(int codePoint) {
    return new String(new int[] { codePoint }, 0, 1).getBytes(StandardCharsets.UTF_8).length;
}

Then I created another version (based on UTF-8 wikipedia article) for simplicity and probably efficiency:

int getUtf8ByteCount_handRolled(int codePoint) {
    if (codePoint > 0x7FFFFFFF) {
        throw new IllegalArgumentException("invalid UTF-8 code point");
    }
    return codePoint <= 0x7F? 1
         : codePoint <= 0x7FF? 2
         : codePoint <= 0xFFFF? 3
         : codePoint <= 0x1FFFFF? 4
         : codePoint <= 0x3FFFFFF? 5
         : 6;
}

After years of struggling with the many lovely subtleties of character encoding, I ran a test and lo! it failed; for all code points from '\uD800' to '\uDFFF', the "stdlib" version returns 1 byte versus 3 bytes for "hand-rolled". For sure, it's the good ol' surrogate characters causing havoc again! Now, from my understanding of those pesky little buggers, I would say that the second version is correct. My questions:

Is String.getBytes() or (Java's UTF-8 implementation) broken, or is it my understanding? (I'm using Oracle Java SE Runtime Environment 1.6.0_22-b04)
Even if incorrect, is it preferable to the "hand-rolled" version for being more consistent with the actual byte encoding/decoding produced by Java's UTF-8?
Correctness considerations apart, do the Java standard libraries provide a cleaner way than my "stlib" one?

Tagir Valeev · Accepted Answer

The problem is that the string consisting of single "surrogate" codepoint is not a valid String at all from the Java point of view. The default behavior of encoder used in String.getBytes() is described in JavaDoc:

This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement byte array. The CharsetEncoder class should be used when more control over the encoding process is required.

The default replacement byte array is the single byte 0x3F (which is '?' symbol in UTF-8), so you've got it when encoding the 0xD800 codepoint. As suggested, you may do it at lower level using the CharsetEncoder:

static int getUtf8ByteCount(int codePoint) throws CharacterCodingException {
    return StandardCharsets.UTF_8
            .newEncoder()
            .encode(CharBuffer.wrap(new String(new int[] { codePoint }, 0, 1)
                    .toCharArray())).array().length;
}

This way supplying 0xD800 you will get a MalformedInputException. Wikipedia says:

Isolated surrogate code points have no general interpretation

So basically you should decide how to deal with these code points. Returning 3 bytes is no more correct than returning 1 byte. It's simply incorrect input, so there's no corresponding correct output for it.

Note that your if (codePoint > 0x7FFFFFFF) condition is meaningless as 0x7FFFFFFF is Integer.MAX_VALUE, so no int value can exceed it. Probably it's better to replace it with if (codePoint < 0)

Java: UTF-8 byte length of a single code point (surrogates again)

Answers (1)

Related Questions