Why is it said: CharacterStream classes are used to perform the input/output for the 16-bit Unicode characters?

Question

When an I/O stream manages 8-bit bytes of raw binary data, it is called a byte stream. And, when the I/O stream manages 16-bit Unicode characters, it is called a character stream.

Byte stream is clear. It uses 8-bit bytes. So if I were to write a character that uses 3 bytes it would only write its last 8 bits! Thus making incorrect output.

So that is why we use character streams. Say I want to write Latin Capital Letter Ạ. I would need 3 bytes for storing in UTF-8. But say I also want to store 'normal' A. Now it would take 1 byte to store.

Are you seeing pattern? We can't know how much bytes it will take for writing any of these characters until we convert them. So my question is why is it said that character streams manage 16-bit Unicode characters? When in case where I wrote Ạ that takes 3 bytes it didn't cut it to last 16-bits like byte streams cut last 8-bits. What does that quote even mean then?

Louis Wasserman · Accepted Answer

In Java, a String is composed of a sequence of 16-bit chars, representing text stored in the UTF-16 encoding.

A Charset is an object that describes how to convert Unicode characters to a sequence of bytes. UTF-8 is an example of a charset.

A character stream like Writer, when it outputs to a thing that contains bytes -- a file, or a byte output stream like OutputStream -- uses a Charset to convert Strings to simple byte sequences for output. (Technically, it converts the UTF-16 chars to Unicode characters and then converts those to byte sequences with the Charset.) A Reader, when reading from a byte source, does the reverse conversion.

In UTF-16, Ạ is represented as the 16-bit char 0x1EA1. It takes only 16 bits in UTF-16, not 24 bits as in UTF-8.

If you converted it to bytes with the UTF-8 encoding, as here:

ByteArrayOutputStream baos = new ByteArrayOutputStream();
Writer writer = new OutputStreamWriter(baos, StandardCharsets.UTF_8);
writer.write("Ạ");
writer.close();
return baos.toByteArray();

Then you would get the 3 byte sequence 0xE1 0xBA 0xA1 as expected.

Why is it said: CharacterStream classes are used to perform the input/output for the 16-bit Unicode characters?

Answers (2)

Related Questions