Reputation: 668
I'm using ubuntu 12.04
I want to know how can I read Chinese using C
setlocale(LC_ALL, "zh_CN.UTF-8");
scanf("%s", st1);
for (b = 0; b < max_w;b++)
{
printf("%d ", st1[b]);
if (st1[b] == 0)
break;
}
For this code, when I input English, it outputs fine, but if I enter Chinese like"的",it outputs
Enter word or sentence (EXIT to break): 的
target char seq :
-25 -102 -124 0
I'm wondering why there is negative values in the array.
Further, I found that the bytes of a "的" in file read using fscanf is different from reading from the console.
Upvotes: 0
Views: 153
Reputation: 74028
UTF-8
encodes characters with a variable number of bytes. This is why you see three bytes for the 的 sign.
At graphemica - 的, you can see that 的 has the value U+7684
which translates to E7
9A
84
when you encode it in UTF-8.
You print every byte separately as an integer value. A char
type might be signed and when it is converted to an integer, you can get negative numbers too. In your case this is
You can print the bytes as hex values with %x
or as an unsigned integer %u
, then you will see positive numbers only.
You can also change your print statement to
printf("%d ", (unsigned char) st1[b]);
which will interpret the bytes as unsigned values and show your output as
231 154 132 0
Upvotes: 3
Reputation: 215221
There's no need (and in fact it's harmful) to hard-code a specific locale name. What characters you can read are independent of the locale's language (used for messages), and any locale with UTF-8 encoding should work fine.
The easiest (but ugly once you try to go too far with it) way to make this work is to use the wide character stdio functions (e.g. getwc
) instead of the byte-oriented ones. Otherwise you can read bytes then process them with mbrtowc
.
Upvotes: 0