Reading and outputting unicode in C

Question

FILE * f = fopen("filename", "r");
int c;

while((c = fgetc(f)) != EOF) {
    printf("%c
", c);
}

Hello, I have searched for a whole hour, found many wise dissertations on Unicode, but no answer to this simple question:

what would be the shortest equivalent to these four lines, that can manage UTF8, on Linux using gcc and bash.

Thank you

teppic · Accepted Answer

Something like this should work, given your system:

#include 
#include 
#include 


int main() {
   setlocale(LC_CTYPE, "en_GB.UTF-8");
   FILE * f = fopen("filename", "r");
   wint_t c;

   while((c = fgetwc(f)) != WEOF) {
      wprintf(L"%lc
", c);
   }
}

The problem with your original code is that C doesn't realise (or care) that the characters are multibyte, and so your multibyte characters will be corrupted by the between each of the bytes. With this version, a character is treated as UTF-8, and so %lc now may represent as many as 6 actual bytes, which are guaranteed to be output correctly. If the input has any ASCII, it'll simply use one byte per character as previously (since ASCII is compatible with UTF-8).

strace is always useful for debugging things like this. As an example, if the file contains just ££ (£ has the UTF-8 sequence \302\243). Your version produces:

write(1, "\302
\243
\302
\243


", 10) = 10

And mine,

write(1, "\302\243
\302\243
", 6)     = 6

Note that once you read or write to a stream (including stdout) it is set to either byte or wide orientation, and you will need to re-open the stream if you want to change it. So for example, if you wanted to read the UTF-8 file, but leave stdout as byte orientated, you could replace the wprintf with:

  printf("%lc
", c);

This involves extra code in the background (to convert the formats), but provides better compatibility with other code that expect a byte stream.

Reading and outputting unicode in C

Answers (1)

Related Questions