Reputation: 1061
It all started as a pretty basic question: Given a char
-- or rather, an integer code point, see Character
API --, return the number of bytes required for its UTF-8 encoding. However, the more time I spent with this innocent little problem, the more confusing it became.
My first approach was:
int getUtf8ByteCount_stdlib(int codePoint) {
int[] codePoints = { codePoint };
String string = new String(codePoints, 0, 1);
byte[] bytes = string.getBytes(StandardCharsets.UTF_8);
return bytes.length;
}
Or for those who like it:
int getUtf8ByteCount_obfuscated(int codePoint) {
return new String(new int[] { codePoint }, 0, 1).getBytes(StandardCharsets.UTF_8).length;
}
Then I created another version (based on UTF-8 wikipedia article) for simplicity and probably efficiency:
int getUtf8ByteCount_handRolled(int codePoint) {
if (codePoint > 0x7FFFFFFF) {
throw new IllegalArgumentException("invalid UTF-8 code point");
}
return codePoint <= 0x7F? 1
: codePoint <= 0x7FF? 2
: codePoint <= 0xFFFF? 3
: codePoint <= 0x1FFFFF? 4
: codePoint <= 0x3FFFFFF? 5
: 6;
}
After years of struggling with the many lovely subtleties of character encoding, I ran a test and lo! it failed; for all code points from '\uD800' to '\uDFFF', the "stdlib" version returns 1 byte versus 3 bytes for "hand-rolled". For sure, it's the good ol' surrogate characters causing havoc again! Now, from my understanding of those pesky little buggers, I would say that the second version is correct. My questions:
String.getBytes()
or (Java's UTF-8 implementation) broken, or is it my understanding? (I'm using Oracle Java SE Runtime Environment 1.6.0_22-b04)Upvotes: 5
Views: 2011
Reputation: 100309
The problem is that the string consisting of single "surrogate" codepoint is not a valid String at all from the Java point of view. The default behavior of encoder used in String.getBytes()
is described in JavaDoc:
This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement byte array. The
CharsetEncoder
class should be used when more control over the encoding process is required.
The default replacement byte array is the single byte 0x3F
(which is '?'
symbol in UTF-8), so you've got it when encoding the 0xD800
codepoint. As suggested, you may do it at lower level using the CharsetEncoder
:
static int getUtf8ByteCount(int codePoint) throws CharacterCodingException {
return StandardCharsets.UTF_8
.newEncoder()
.encode(CharBuffer.wrap(new String(new int[] { codePoint }, 0, 1)
.toCharArray())).array().length;
}
This way supplying 0xD800
you will get a MalformedInputException
. Wikipedia says:
Isolated surrogate code points have no general interpretation
So basically you should decide how to deal with these code points. Returning 3 bytes is no more correct than returning 1 byte. It's simply incorrect input, so there's no corresponding correct output for it.
Note that your if (codePoint > 0x7FFFFFFF)
condition is meaningless as 0x7FFFFFFF
is Integer.MAX_VALUE
, so no int
value can exceed it. Probably it's better to replace it with if (codePoint < 0)
Upvotes: 2