Reputation: 11266
I have a Unicode codepoint, which could be anything: possibly ASCII, possibly something in the BMP, and possibly an exotic emoji such as U+1F612.
I expected there would be an easy way to take a codepoint and encode it into a byte array, but I can't find a simple way. I can turn it into a String, and then encode it, but that is a round-about way involving first encoding it to UTF-16 and then re-encoding it to the required encoding. I'd like to encode it directly to bytes.
public static byte[] encodeCodePoint(int codePoint, Charset charset) {
// Surely there's got to be a better way than this:
return new StringBuilder().appendCodePoint(codePoint).toString().getBytes(charset);
}
Upvotes: 3
Views: 295
Reputation: 21
If you want to encode emoji to four bytes in UTF-16
out.write(0xf0 | ((codePoint >> 18)));
out.write(0x80 | ((codePoint >> 12) & 0x3f));
out.write(0x80 | ((codePoint >> 6) & 0x3f));
out.write(0x80 | (codePoint & 0x3f));
Whole function converting chars to bytes. Ineed to write bytes to stream and number if it at beginning. You can change it to create byte array
void writeStringBytes(DataOutput out, char[] chars,final int off,final int strlen) throws IOException {
int utflen = strlen; // optimized for ASCII
// counting bytes we are need
for (int i = 0; i < strlen; i++) {
int c = chars[off+i];
if (c >= 0x80 || c == 0){
if((c>=Character.MIN_HIGH_SURROGATE && c<=Character.MAX_HIGH_SURROGATE) ||
(c>=Character.MIN_LOW_SURROGATE && c<=Character.MAX_LOW_SURROGATE)){
utflen += 1;
} else {
utflen += (c >= 0x800) ? 2 : 1;
}
}
}
out.writeInt(utflen); // i need number of bytes first. You can create array here. new byte[utflen]
if(utflen==strlen){// only ascii chars
for (int i = 0; i < strlen; i++) {
out.write(chars[off+i]);
}
return;
}
for (int i=0; i < strlen; i++) {
int c = chars[off+i];
if (c < 0x80 && c != 0) {
out.write(c);
} else if((c>=Character.MIN_HIGH_SURROGATE && c<=Character.MAX_HIGH_SURROGATE)) {
int uc = Character.codePointAt(chars,off+i);
if (uc < 0) {// bad codePoint
out.write('?');
out.write('?');
} else {
out.write(0xf0 | ((uc >> 18)));
out.write(0x80 | ((uc >> 12) & 0x3f));
out.write(0x80 | ((uc >> 6) & 0x3f));
out.write(0x80 | (uc & 0x3f));
i++;
}
} else if (c >= 0x800) {
out.write(0xE0 | ((c >> 12) & 0x0F));
out.write(0x80 | ((c >> 6) & 0x3F));
out.write(0x80 | ((c >> 0) & 0x3F));
} else {
out.write(0xC0 | ((c >> 6) & 0x1F));
out.write(0x80 | ((c >> 0) & 0x3F));
}
}
}
Upvotes: 0
Reputation: 598134
There is really no way to avoid using UTF-16, since Java uses UTF-16 for text data, and that is what the charset convertors are designed for. But, that doesn't mean you have to use a String
for the UTF-16 data:
public static byte[] encodeCodePoint(int codePoint, Charset charset) {
char[] chars = Character.toChars(codePoint);
CharBuffer cb = CharBuffer.wrap(chars);
ByteBuffer buff = charset.encode(cb);
byte[] bytes = new byte[buff.remaining()];
buff.get(bytes);
return bytes;
}
Upvotes: 1