How can C read chinese from console and file

Question

I'm using ubuntu 12.04
I want to know how can I read Chinese using C

  setlocale(LC_ALL, "zh_CN.UTF-8");
  scanf("%s", st1);
  for (b = 0; b < max_w;b++)
  {
    printf("%d ", st1[b]);
    if (st1[b] == 0)
        break;
  }

For this code, when I input English, it outputs fine, but if I enter Chinese like"的"，it outputs

Enter word or sentence (EXIT to break): 的
target char seq :
-25 -102 -124 0

I'm wondering why there is negative values in the array.
Further, I found that the bytes of a "的" in file read using fscanf is different from reading from the console.

Olaf Dietsche · Accepted Answer

UTF-8 encodes characters with a variable number of bytes. This is why you see three bytes for the 的 sign.

At graphemica - 的, you can see that 的 has the value U+7684 which translates to E7 9A 84 when you encode it in UTF-8.

You print every byte separately as an integer value. A char type might be signed and when it is converted to an integer, you can get negative numbers too. In your case this is

-25 = E7
-102 = 9A
-124 = 84

You can print the bytes as hex values with %x or as an unsigned integer %u, then you will see positive numbers only.

You can also change your print statement to

printf("%d ", (unsigned char) st1[b]);

which will interpret the bytes as unsigned values and show your output as

231 154 132 0

How can C read chinese from console and file

Answers (2)

Related Questions