Reputation: 1834
On my system, a pretty normal Ubuntu 13.10, the french accented characters "éèàçù..." are always handled correctly by whatever tools I use, despite LC_ environment variables being set to en_US.UTF-8. In particular command line utilities like grep, cat, ... always read and print these characters without a hitch.
Despite these remarks, such a small program as
int main() {
printf("%c", getchar());
return 0;
}
fails when the user enters "é".
From the man pages, and a lot of googling, there is no standard way to close stdout, then reopening it. From man fwide(), if stdout is in byte mode, I can't pass it to wide character mode, short of closing it and reopening it... therefore I can't use getwchar() and wprintf().
I can't believe that every single utility like cat, grep, etc... reimplements a way to manage wide characters, yet from my research, I see no other way.
Is it my system that has a problem? I can't see how since every utility works flawlessly. What am I missing, please?
Upvotes: 1
Views: 870
Reputation: 239051
When a C program starts, stdout
, stdin
and stderr
are neither byte nor wide-character oriented. fwide(stdin, 0)
should return 0 at this point.
If you expand your minimal program to:
#include <stdio.h>
#include <locale.h>
#include <wchar.h>
int main()
{
setlocale(LC_ALL, "");
printf("%lc\n", getwchar());
return 0;
}
Then it should work as you expect. (There is no need to explicitly set the orientation of stdin
here - since the first operation on it is a wide-character operation, it will have wide-character orientation).
You do need to use getwchar()
instead of getchar()
if you want to read a wide character with it, though.
Upvotes: 3
Reputation: 399863
The utilities you mention are generally line-oriented. If you were to try to read a whole line with e.g. fgets()
rather than a single character, I think it'll work for you, too.
When you start reading single characters (which may be just bytes, and often are), you are of course very much susceptible to encoding issues.
Reading full lines will work just fine, as long as the line-termiation encoding is not mis-understood (and for UTF-8 it won't be).
Upvotes: 0
Reputation: 6879
UTF-8 character are taken as byte code not character and non ascii character are more then one byte. Check this Question
for more info
Upvotes: 0