Reputation: 1834
FILE * f = fopen("filename", "r");
int c;
while((c = fgetc(f)) != EOF) {
printf("%c\n", c);
}
Hello, I have searched for a whole hour, found many wise dissertations on Unicode, but no answer to this simple question:
what would be the shortest equivalent to these four lines, that can manage UTF8, on Linux using gcc and bash.
Thank you
Upvotes: 4
Views: 964
Reputation: 8205
Something like this should work, given your system:
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main() {
setlocale(LC_CTYPE, "en_GB.UTF-8");
FILE * f = fopen("filename", "r");
wint_t c;
while((c = fgetwc(f)) != WEOF) {
wprintf(L"%lc\n", c);
}
}
The problem with your original code is that C doesn't realise (or care) that the characters are multibyte, and so your multibyte characters will be corrupted by the \n
between each of the bytes. With this version, a character is treated as UTF-8, and so %lc
now may represent as many as 6 actual bytes, which are guaranteed to be output correctly. If the input has any ASCII, it'll simply use one byte per character as previously (since ASCII is compatible with UTF-8).
strace
is always useful for debugging things like this. As an example, if the file contains just ££
(£ has the UTF-8 sequence \302\243). Your version produces:
write(1, "\302\n\243\n\302\n\243\n\n\n", 10) = 10
And mine,
write(1, "\302\243\n\302\243\n", 6) = 6
Note that once you read or write to a stream (including stdout
) it is set to either byte or wide orientation, and you will need to re-open the stream if you want to change it. So for example, if you wanted to read the UTF-8 file, but leave stdout
as byte orientated, you could replace the wprintf
with:
printf("%lc\n", c);
This involves extra code in the background (to convert the formats), but provides better compatibility with other code that expect a byte stream.
Upvotes: 6