Reputation: 2497
I'm trying to figure how C handles character codes in Unicode. I set my locale to LC_ALL "fr_CA.UTF8" then input a char with wscanf()
(as an array of wchar_t
...). I then explore each byte and I find something strange. I entered a treble key ( "𝄞" copied from a web page) that is U+1d11e. This is actually 3 bytes so I expected to have 2 wchar_t
. I got: 0x1e, 0xd1, 0x00, 0x00, the last two ones being the null ending char. Here is my code:
#include <stdio.h>
#include <locale.h>
#include <wchar.h>
int main ( int argc, char* argv[] )
{
setlocale( LC_ALL, "fr_CA.utf8" );
wchar_t input[256];
wscanf( L"%ls", input);
wprintf( L"%ls\n", input );
wprintf( L"Length = %d\n", wcslen( input ) );
wprintf( L"%d\n", (int)(input[0]&0x00ff) );
wprintf( L"%d\n", (int)((input[0]&0xff00)>>8) );
wprintf( L"%d\n", (int)(input[1]&0x00ff) );
wprintf( L"%d\n", (int)((input[1]&0xff00)>>8) );
wprintf( L"%d\n", (int)(input[2]&0x00ff) );
wprintf( L"%d\n", (int)((input[2]&0xff00)>>8) );
return 0;
}
My expectation was to have 0x1e, 0xd1, 0x01, 0x00, 0x00, 0x00...
But I have 0x1e, 0xd1, 0x00, 0x00...
What mesmerizes me is that the wprintf( L"%ls\n", input );
actually prints the treble key correctly... So what allows to distinguish between characters U+1D11E and U+D11E?
Also, I'm running my program in Konsole in Kubuntu 16.04 LTS and I compiled it with gcc 6.5.0... if it matters.
Upvotes: 1
Views: 204
Reputation: 215193
You would see what you expected if you printed the bytes making up the wchar_t
values correctly, or if you just skipped that and printed their values without trying to break them up into bytes:
wprintf(L"%x\n", (int)input[0]);
wprintf(L"%x\n", (int)input[1]);
And the output would be:
1d11e
0
The way you tried to do this suggests that you're under the mistaken impression that wchar_t
values are 16-bit and that there's such a thing as a "multi-wchar_t
-character". The C language is very explicit that there's no such thing. Implementations with 16-bit wchar_t
are wrong (or at least can't support Unicode outside the BMP). Of course one rather popular one is badly wrong...
I just noticed you've also mentioned UTF-8 in the title of your question, but the content has nothing to do with UTF-8 representation. wchar_t
is (normally; not entirely required) a Unicode codepoint number, equivalent to UCS-4 (or UCS-2 on implementations that only support the BMP). While the locale's multibyte encoding almost certainly has to be UTF-8 in order for you to have access to that character (although GB18030 would also work), UTF-8 is not going to appear if you're working with all your streams as wide character streams.
Upvotes: 6