Reputation:
I've been trying to understand the working principle of the operator<<
of std::cout
in C++. I've found that it prints UTF-8 symbols, for instance:
The simple program is:
#include <iostream>
unsigned char t[] = "ي";
unsigned char m0 = t[0];
unsigned char m1 = t[1];
int main()
{
std::cout << t << std::endl; // Prints ي
std::cout << (int)t[0] << std::endl; // Prints 217
std::cout << (int)t[1] << std::endl; // Prints 138
std::cout << m0 << std::endl; // Prints �
std::cout << m1 << std::endl; // Prints �
}
How does the terminal that produces output determine that it must interpret t
as a single symbol ي
, but not as two symbols � �
?
Upvotes: 0
Views: 133
Reputation: 153899
You are dealing with two different types, unsigned char[]
and unsigned char
.
If you were to do sizeof
on t
, you'd find that it occupied
three bytes, and strlen( t )
will return 2. On the other
hand, m0
and m1
are single characters.
When you output a unsigned char[]
, it is converted to an
unsigned char*
, and the stream outputs all of the bytes until
it encounters a '\0'
(which is the third byte in t
). When
you output an unsigned char
, the stream outputs just that
byte. So in your first line, the output device receives
2 bytes, and then the end of line. In the last two, it receives
1 byte, and then the end of line. And that byte, followed by
the end of line, is not a legal UTF-8 character, so the display
device displays something to indicate that there was an error,
or that it did not understand.
When working with UTF-8 (or any other multibyte encoding), you cannot extract single bytes from a string and expect them to have any real meaning.
Upvotes: 4
Reputation: 79441
The terminal is determining how to display the bytes you are feeding it. You are feeding it a newline (std::endl
) between the two bytes of the 2-byte UTF-8-encoded Unicode character. Instead of this:
std::cout << m0 << std::endl; // Prints �
std::cout << m1 << std::endl; // Prints �
Try this:
std::cout << m0 << m1 << std::endl; // Prints ي
Why do m0
and m1
print as �
in your original code?
Because your code is sending the bytes [217, 110, 138, 110]
, which is not interpretable as UTF-8. (Assuming std::endl
corresponds to the \n
character, value 110.)
Upvotes: 0