pouzzler
pouzzler

Reputation: 1834

Reading and outputting unicode in C

FILE * f = fopen("filename", "r");
int c;

while((c = fgetc(f)) != EOF) {
    printf("%c\n", c);
}

Hello, I have searched for a whole hour, found many wise dissertations on Unicode, but no answer to this simple question:

what would be the shortest equivalent to these four lines, that can manage UTF8, on Linux using gcc and bash.

Thank you

Upvotes: 4

Views: 964

Answers (1)

teppic
teppic

Reputation: 8205

Something like this should work, given your system:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>


int main() {
   setlocale(LC_CTYPE, "en_GB.UTF-8");
   FILE * f = fopen("filename", "r");
   wint_t c;

   while((c = fgetwc(f)) != WEOF) {
      wprintf(L"%lc\n", c);
   }
}

The problem with your original code is that C doesn't realise (or care) that the characters are multibyte, and so your multibyte characters will be corrupted by the \n between each of the bytes. With this version, a character is treated as UTF-8, and so %lc now may represent as many as 6 actual bytes, which are guaranteed to be output correctly. If the input has any ASCII, it'll simply use one byte per character as previously (since ASCII is compatible with UTF-8).

strace is always useful for debugging things like this. As an example, if the file contains just ££ (£ has the UTF-8 sequence \302\243). Your version produces:

write(1, "\302\n\243\n\302\n\243\n\n\n", 10) = 10

And mine,

write(1, "\302\243\n\302\243\n", 6)     = 6

Note that once you read or write to a stream (including stdout) it is set to either byte or wide orientation, and you will need to re-open the stream if you want to change it. So for example, if you wanted to read the UTF-8 file, but leave stdout as byte orientated, you could replace the wprintf with:

  printf("%lc\n", c);

This involves extra code in the background (to convert the formats), but provides better compatibility with other code that expect a byte stream.

Upvotes: 6

Related Questions