Reputation: 1059
When an I/O stream manages 8-bit bytes of raw binary data, it is called a byte stream. And, when the I/O stream manages 16-bit Unicode characters, it is called a character stream.
Byte stream
is clear. It uses 8-bit bytes. So if I were to write a character that uses 3 bytes it would only write its last 8 bits! Thus making incorrect output.
So that is why we use character streams
. Say I want to write Latin Capital Letter Ạ
. I would need 3 bytes for storing in UTF-8. But say I also want to store 'normal' A
. Now it would take 1 byte to store.
Are you seeing pattern? We can't know how much bytes it will take for writing any of these characters until we convert them. So my question is why is it said that character streams manage 16-bit Unicode characters
? When in case where I wrote Ạ
that takes 3 bytes it didn't cut it to last 16-bits like byte streams
cut last 8-bits. What does that quote even mean then?
Upvotes: 3
Views: 311
Reputation: 198211
In Java, a String
is composed of a sequence of 16-bit char
s, representing text stored in the UTF-16 encoding.
A Charset
is an object that describes how to convert Unicode characters to a sequence of bytes. UTF-8 is an example of a charset.
A character stream like Writer
, when it outputs to a thing that contains bytes -- a file, or a byte output stream like OutputStream
-- uses a Charset
to convert String
s to simple byte sequences for output. (Technically, it converts the UTF-16 chars to Unicode characters and then converts those to byte sequences with the Charset
.) A Reader
, when reading from a byte source, does the reverse conversion.
In UTF-16, Ạ is represented as the 16-bit char
0x1EA1
. It takes only 16 bits in UTF-16, not 24 bits as in UTF-8.
If you converted it to bytes with the UTF-8 encoding, as here:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Writer writer = new OutputStreamWriter(baos, StandardCharsets.UTF_8);
writer.write("Ạ");
writer.close();
return baos.toByteArray();
Then you would get the 3 byte sequence 0xE1 0xBA 0xA1 as expected.
Upvotes: 3
Reputation: 272525
In Java, a character (char
) is always 16 bits, as can be seen from its max value - 65535. This is why the quote is not wrong. 16 bit is indeed a character.
"How can all the Unicode characters be stored in just 16 bits?" you might ask. This is done in Java using the UTF-16 encoding. Here's how it works (in very simplified terms):
Every Unicode code point in the Basic Multilingual Plane is encoded in 16 bits. (Yes 16 bit is enough for that) Every code point outside of the BMP is encoded with a pair of 16 bit characters, called surrogate pairs.
"Ạ" (U+1EA0) is inside the BMP, so can be encoded with 16 bits.
You said:
Say I want to write Latin Capital Letter Ạ. I would need 3 bytes for storing in UTF-8. But say I also want to store 'normal' A. Now it would take 1 byte to store!
That does not make the quote incorrect. The stream still "manages 16-bit characters", because that's what you will give it with Java code. When you call println
on a PrintStream
, you are giving it a String
, which is a bunch of char
s under the hood, which is a bunch of 16-bits. So it is really managing a stream of 16-bit characters. It's just that it outputs them in a different encoding.
It's probably worth mentioning what happens when you try to print a character that is not in the BMP. This would still not make the quote incorrect. The quote does not say "code point". It says "character" which would refer to the upper/lower surrogates of the surrogate pair that you are printing.
Upvotes: 2