Alex Hansen
Alex Hansen

Reputation: 301

Reading and printing chinese characters using fread() and printf()?

I am trying to read Chinese characters from an infile, and I have found a few questions on the subject here but nothing that works for me or suits my needs. I am using the fread() implementation from this question, but it is not working. I am running Linux.

  #define UNICODE
  #ifdef UNICODE
  #define _UNICODE
  #else
  #define _MBCS
  #endif

  #include <locale.h>
  #include <stdio.h>
  #include <wchar.h>
  #include <string.h>
  #include <stdlib.h>
  int main(int argc, char * argv[]) {
         FILE *infile = fopen(argv[1], "r");
         wchar_t test[2] = L"\u4E2A";
         setlocale(LC_ALL, "");
         printf("%ls\n", test); //test
         wcscpy(test, L"\u4F60"); //test
         printf("%ls\n", test); //test
         for (int i = 0; i < 5; i++){
                 fread(test, 2, 2, infile);
                 printf("%ls\n", test);
         }
 return 0;
  }

I use the following text file to test it:

 一个人
 两本书
 三张桌子
 我喜欢一个猫                  

and the program outputs:

个 
你
������ 

Anyone have any wisdom on the subject?

Edit: Also, that's all of my code because I'm not sure where it fails. There's some stuff in there where I test to make sure I can print unicode wchars that isn't entirely relevant to the question.

Upvotes: 1

Views: 1910

Answers (2)

user3710044
user3710044

Reputation: 2334

If you really need to read a UTF-8 (or rather a locale charmap) file one codepoint at a time you can use fscanf as below. But do note, this is codepoints not characters, characters may consist of multiple codepoints because of combining codes and some of the codepoints are most definitely not printable.

#include <locale.h>
#include <stdio.h>
#include <wchar.h>
#include <string.h>
#include <stdlib.h>
int
main(int argc, char *argv[])
{
    FILE   *infile = fopen(argv[1], "r");
    wchar_t test[2] = L"\u4E2A";
    setlocale(LC_ALL, "");
    printf("%ls\n", test);  //test
    wcscpy(test, L"\u4F60");        //test
    printf("%ls\n", test);  //test
    for (int i = 0; i < 5; i++) {
        fscanf(infile, "%1ls", test);
        printf("%ls\n", test);
    }
    return 0;
}

Most of the time you probably won't need to use the locale functionality because UTF-8 generally just works if you treat it as an opaque encoding. Part of this is because all non ASCII characters have all their component bytes in the 128..253 range (not a typo, 254 and 255 are unused) another part is that the bytes 128..159 are always continuation bytes all the start bytes for characters are 160..253 which means an error will just break one character not the rest of the stream. (Okay, codepoints vs characters is only really there to try to convince you that dividing UTF-8 up into "characters" probably won't do what you want).

Upvotes: 1

chepner
chepner

Reputation: 531075

You are telling fread to read two 2-byte values in each call; however, the characters you want to read have 3-byte UTF-8 encodings. In general, you need to decode the UTF-8 stream as a whole, not in fixed-sized byte chunks.

Upvotes: 0

Related Questions