Why is UTF-8 coded on 2 bytes for a U+1xxxx character?

Question

I'm trying to figure how C handles character codes in Unicode. I set my locale to LC_ALL "fr_CA.UTF8" then input a char with wscanf() (as an array of wchar_t...). I then explore each byte and I find something strange. I entered a treble key ( "𝄞" copied from a web page) that is U+1d11e. This is actually 3 bytes so I expected to have 2 wchar_t. I got: 0x1e, 0xd1, 0x00, 0x00, the last two ones being the null ending char. Here is my code:

#include 
#include 
#include 

int main ( int argc, char* argv[] )
{  
    setlocale( LC_ALL, "fr_CA.utf8" ); 

    wchar_t input[256];

    wscanf( L"%ls", input);
    wprintf( L"%ls
", input );

    wprintf( L"Length = %d
", wcslen( input ) );
    wprintf( L"%d
", (int)(input[0]&0x00ff) );
    wprintf( L"%d
", (int)((input[0]&0xff00)>>8) );
    wprintf( L"%d
", (int)(input[1]&0x00ff) );
    wprintf( L"%d
", (int)((input[1]&0xff00)>>8) );
    wprintf( L"%d
", (int)(input[2]&0x00ff) );
    wprintf( L"%d
", (int)((input[2]&0xff00)>>8) );

    return 0;
}

My expectation was to have 0x1e, 0xd1, 0x01, 0x00, 0x00, 0x00...

But I have 0x1e, 0xd1, 0x00, 0x00...

What mesmerizes me is that the wprintf( L"%ls ", input ); actually prints the treble key correctly... So what allows to distinguish between characters U+1D11E and U+D11E?

Also, I'm running my program in Konsole in Kubuntu 16.04 LTS and I compiled it with gcc 6.5.0... if it matters.

R.. GitHub STOP HELPING ICE · Accepted Answer

You would see what you expected if you printed the bytes making up the wchar_t values correctly, or if you just skipped that and printed their values without trying to break them up into bytes:

wprintf(L"%x
", (int)input[0]);
wprintf(L"%x
", (int)input[1]);

And the output would be:

1d11e
0

The way you tried to do this suggests that you're under the mistaken impression that wchar_t values are 16-bit and that there's such a thing as a "multi-wchar_t-character". The C language is very explicit that there's no such thing. Implementations with 16-bit wchar_t are wrong (or at least can't support Unicode outside the BMP). Of course one rather popular one is badly wrong...

I just noticed you've also mentioned UTF-8 in the title of your question, but the content has nothing to do with UTF-8 representation. wchar_t is (normally; not entirely required) a Unicode codepoint number, equivalent to UCS-4 (or UCS-2 on implementations that only support the BMP). While the locale's multibyte encoding almost certainly has to be UTF-8 in order for you to have access to that character (although GB18030 would also work), UTF-8 is not going to appear if you're working with all your streams as wide character streams.

Why is UTF-8 coded on 2 bytes for a U+1xxxx character?

Answers (1)

Related Questions