overexchange
overexchange

Reputation: 1

Query on reading bytes from "UTF-8" world to Java "char"

With the below code snippet given in this link,

byte[] bytes = {0x00, 0x48, 0x00, 0x69, 0x00, 0x2C,
                      0x60, (byte)0xA8, 0x59, 0x7D, 0x00, 0x21};  // "Hi,您好!"

Charset charset = Charset.forName("UTF-8");
// Encode from UCS-2 to UTF-8
// Create a ByteBuffer by wrapping a byte array
ByteBuffer bb = ByteBuffer.wrap(bytes);
// Create a CharBuffer from a view of this ByteBuffer
CharBuffer cb = bb.asCharBuffer();

Using wrap() method, "The new buffer will be backed by the given byte array", Here we do not have any encoding from byte to other format, it just placed byte array in a buffer.

Can you please help me understand, what exactly are we doing when we say bb.asCharBuffer() in the above code?cb is similar to array of characters. Because char is UTF-16 in Java, Using asCharBuffer() method, Are we considering every 2bytes in bb as char? Is this the right approach? If no, Please help me with right approach.

Edit: I tried this program as recommended by Meisch below,

byte[] bytes = {0x00, 0x48, 0x00, 0x69, 0x00, 0x2C,
                0x60, (byte)0xA8, 0x59, 0x7D, 0x00, 0x21};  // "Hi,您好!"

        Charset charset = Charset.forName("UTF-8");
        CharsetDecoder decoder = charset.newDecoder();
        ByteBuffer bb = ByteBuffer.wrap(bytes);
        CharBuffer cb = decoder.decode(bb);

which gives exception

Exception in thread "main" java.nio.charset.MalformedInputException: Input length = 1
    at java.nio.charset.CoderResult.throwException(Unknown Source)
    at java.nio.charset.CharsetDecoder.decode(Unknown Source)
    at TestCharSet.main(TestCharSet.java:16)

Please help me, am stuck up here!!!

Note : am using java 1.6

Upvotes: 3

Views: 1787

Answers (2)

VGR
VGR

Reputation: 44318

You ask: “Because char is UTF-16 in Java, using asCharBuffer() method, are we considering every 2 bytes in bb as char?”

The answer to that question is yes. Your understanding is correct.

Your next question is: “Is this the right approach?”

If you are just trying to demonstrate how the ByteBuffer, CharBuffer and Charset classes work, it's acceptable.

However, when you are coding an application, you will never write code like that. To begin with, there is no need for a byte array; you can represent the characters as a literal String:

String s = "Hi,\u60a8\u597d!";

If you want to convert the string to UTF-8 bytes, you can simply do this:

byte[] encodedBytes = s.getBytes(StandardCharsets.UTF_8);

If you're still using Java 6, you would do this instead:

byte[] encodedBytes = s.getBytes("UTF-8");

Update: Your byte array represents chars in the UTF-16BE (big-endian) encoding. Specifically, your array has exactly two bytes per character. That is not a valid UTF-8 encoded byte sequence, which is why you're getting the MalformedInputException.

When characters are encoded as UTF-8 bytes, each character will be represented with 1 to 4 bytes. For your second code fragement to work, the array must be:

byte[] bytes = {
    0x48, 0x69, 0x2c,                       // ASCII chars are 1 byte each
    (byte) 0xe6, (byte) 0x82, (byte) 0xa8,  // U+60A8
    (byte) 0xe5, (byte) 0xa5, (byte) 0xbd,  // U+597D
    0x21
};

When converting from bytes to chars, my earlier statement still applies: You don't need ByteBuffer or CharBuffer or Charset or CharsetDecoder. You can use those classes, but usually it's more succinct to just create a String:

String s = new String(bytes, "UTF-8");

If you want a CharBuffer, just wrap the String:

CharBuffer cb = CharBuffer.wrap(s);

You may be wondering when it is appropriate to use a CharsetDecoder directly. You would do that if the bytes are coming from a source which is not under your control, and you have good reason to believe it may not contain properly UTF-8 encoded bytes. Using an explicit CharsetDecoder allows you to customize how invalid bytes will be handled.

Upvotes: 3

Abacus
Abacus

Reputation: 19421

I just had a look at the sources, it boils down to two bytes from the byte buffer being combined into one character. The order in which the two bytes are used depends on the endianness, default ist big-endian.

Another approach using nio classes than what I wrote in the comments would be to use the CharsetDecoder.decode() method.

Charset charset = Charset.forName("UTF-8");
CharsetDecoder decoder = charset.newDecoder();
ByteBuffer bb = ByteBuffer.wrap(bytes);
CharBuffer cb = decoder.decode(bb);

Upvotes: 0

Related Questions