user2953119
user2953119

Reputation:

UTF-8 symbol written to the terminal output

I've been trying to understand the working principle of the operator<< of std::cout in C++. I've found that it prints UTF-8 symbols, for instance: The simple program is:

#include <iostream>

unsigned char t[] = "ي";
unsigned char m0 = t[0];
unsigned char m1 = t[1];

int main()
{
    std::cout << t << std::endl;           // Prints ي
    std::cout << (int)t[0] << std::endl;   // Prints 217
    std::cout << (int)t[1] << std::endl;   // Prints 138
    std::cout << m0 << std::endl;          // Prints �
    std::cout << m1 << std::endl;          // Prints �
}

DEMO

How does the terminal that produces output determine that it must interpret t as a single symbol ي, but not as two symbols � �?

Upvotes: 0

Views: 133

Answers (2)

James Kanze
James Kanze

Reputation: 153899

You are dealing with two different types, unsigned char[] and unsigned char. If you were to do sizeof on t, you'd find that it occupied three bytes, and strlen( t ) will return 2. On the other hand, m0 and m1 are single characters.

When you output a unsigned char[], it is converted to an unsigned char*, and the stream outputs all of the bytes until it encounters a '\0' (which is the third byte in t). When you output an unsigned char, the stream outputs just that byte. So in your first line, the output device receives 2 bytes, and then the end of line. In the last two, it receives 1 byte, and then the end of line. And that byte, followed by the end of line, is not a legal UTF-8 character, so the display device displays something to indicate that there was an error, or that it did not understand.

When working with UTF-8 (or any other multibyte encoding), you cannot extract single bytes from a string and expect them to have any real meaning.

Upvotes: 4

Timothy Shields
Timothy Shields

Reputation: 79441

The terminal is determining how to display the bytes you are feeding it. You are feeding it a newline (std::endl) between the two bytes of the 2-byte UTF-8-encoded Unicode character. Instead of this:

std::cout << m0 << std::endl;       // Prints �
std::cout << m1 << std::endl;       // Prints �

Try this:

std::cout << m0 << m1 << std::endl; // Prints ي

Why do m0 and m1 print as in your original code?
Because your code is sending the bytes [217, 110, 138, 110], which is not interpretable as UTF-8. (Assuming std::endl corresponds to the \n character, value 110.)

Upvotes: 0

Related Questions