k314159
k314159

Reputation: 11266

Encode a codepoint

I have a Unicode codepoint, which could be anything: possibly ASCII, possibly something in the BMP, and possibly an exotic emoji such as U+1F612.

I expected there would be an easy way to take a codepoint and encode it into a byte array, but I can't find a simple way. I can turn it into a String, and then encode it, but that is a round-about way involving first encoding it to UTF-16 and then re-encoding it to the required encoding. I'd like to encode it directly to bytes.

public static byte[] encodeCodePoint(int codePoint, Charset charset) {
    // Surely there's got to be a better way than this:
    return new StringBuilder().appendCodePoint(codePoint).toString().getBytes(charset);
}

Upvotes: 3

Views: 295

Answers (2)

Arkadiusz
Arkadiusz

Reputation: 21

If you want to encode emoji to four bytes in UTF-16

out.write(0xf0 | ((codePoint >> 18)));
out.write(0x80 | ((codePoint >> 12) & 0x3f));
out.write(0x80 | ((codePoint >>  6) & 0x3f));
out.write(0x80 | (codePoint & 0x3f));

Whole function converting chars to bytes. Ineed to write bytes to stream and number if it at beginning. You can change it to create byte array

void writeStringBytes(DataOutput out, char[] chars,final int off,final int strlen) throws IOException {
    
    int utflen = strlen; // optimized for ASCII

// counting bytes we are need 

    for (int i = 0; i < strlen; i++) {
        int c = chars[off+i];            
        if (c >= 0x80 || c == 0){
            if((c>=Character.MIN_HIGH_SURROGATE && c<=Character.MAX_HIGH_SURROGATE) ||
                    (c>=Character.MIN_LOW_SURROGATE && c<=Character.MAX_LOW_SURROGATE)){
                utflen += 1;
            } else {
                utflen += (c >= 0x800) ? 2 : 1;
            }
        }
    }
    
    
    out.writeInt(utflen); // i need number of bytes first. You can create array here. new byte[utflen]
    
    
    if(utflen==strlen){// only ascii chars
        for (int i = 0; i < strlen; i++) {
            out.write(chars[off+i]);
        }
        return;
    }
    
    for (int i=0; i < strlen; i++) {
        int c = chars[off+i];
        if (c < 0x80 && c != 0) {
            out.write(c);
        } else if((c>=Character.MIN_HIGH_SURROGATE && c<=Character.MAX_HIGH_SURROGATE)) {
            int uc = Character.codePointAt(chars,off+i);
            if (uc < 0) {// bad codePoint
                out.write('?');
                out.write('?');
            } else {
                out.write(0xf0 | ((uc >> 18)));
                out.write(0x80 | ((uc >> 12) & 0x3f));
                out.write(0x80 | ((uc >>  6) & 0x3f));
                out.write(0x80 | (uc & 0x3f));
                i++;
            }                                
        } else if (c >= 0x800) {
            out.write(0xE0 | ((c >> 12) & 0x0F));
            out.write(0x80 | ((c >>  6) & 0x3F));
            out.write(0x80 | ((c >>  0) & 0x3F));
        } else {
            out.write(0xC0 | ((c >>  6) & 0x1F));
            out.write(0x80 | ((c >>  0) & 0x3F));
        }
    }
}    

Upvotes: 0

Remy Lebeau
Remy Lebeau

Reputation: 598134

There is really no way to avoid using UTF-16, since Java uses UTF-16 for text data, and that is what the charset convertors are designed for. But, that doesn't mean you have to use a String for the UTF-16 data:

public static byte[] encodeCodePoint(int codePoint, Charset charset) {
    char[] chars = Character.toChars(codePoint);
    CharBuffer cb = CharBuffer.wrap(chars);
    ByteBuffer buff = charset.encode(cb);
    byte[] bytes = new byte[buff.remaining()];
    buff.get(bytes);
    return bytes;
}

Upvotes: 1

Related Questions