Brian Schlenker
Brian Schlenker

Reputation: 5446

How do c++ and g++ deal with unicode?

I'm trying to figure out the proper way to deal with unicode in c++. I want to understand how g++ handles literal wide character strings, and regular c strings containing unicode characters. I have set up some basic tests and don't really understand what is happening.

wstring ws1(L"«¬.txt"); // these first 2 characters correspond to 0xAB, 0xAC
string s1("«¬.txt");

ifstream in_file( s1.c_str() );
// wifstream in_file( s1.c_str() ); // this throws an exception when I 
                                    // call in_file >> s;
string s;
in_file >> s; // s now contains «¬

wstring ws = textToWide(s);

wcout << ws << endl; // these two lines work independently of each other,
                     // but combining them makes the second one print incorrectly
cout << s << endl;
printf( "%s", s.c_str() ); // same case here, these work independently of one another,
                           // but calling one after the other makes the second call
                           // print incorrectly
wprintf( L"%s", ws.c_str() );

wstring textToWide(string s)
{
    mbstate_t mbstate;
    char *cc = new char[s.length() + 1];
    strcpy(cc, s.c_str());
    cc[s.length()] = 0;
    size_t numbytes = mbsrtowcs(0, (const char **)&cc, 0, &mbstate);
    wchar_t *buff = new wchar_t[numbytes + 1];
    mbsrtowcs(buff, (const char **)&cc, numbytes + 1, &mbstate);
    wstring ws = buff;
    delete [] cc;
    delete [] buff;
    return ws;
}

It seems like calls to wcout and wprintf corrupt the stream somehow, and that it is always safe to call cout and printf as long as strings are encoded as utf-8.

Would the best way to handle unicode be to convert all input to wide before processing, and convert all output to utf-8 before sending to outupt?

Upvotes: 1

Views: 2376

Answers (1)

n. m. could be an AI
n. m. could be an AI

Reputation: 120059

The most comprehensive way to handle Unicode is to use a Unicode library such as ICU. Unicode has many more aspects than a bunch of encodings. C++ does not offer APIs to work with any of these extra aspects. ICU does.

If you only want to handle encodings, then a somewhat working way is to use built-in C++ methods correctly. This includes calling

std::setlocale(LC_ALL, 
               /*some system-specific locale name, probably */ "en_US.UTF-8")

in the beginning of the program. Also, not using cout/printf and wcout/wprintf in the same program. (You can use regular and wide stream objects other than the standard handles in the same program).

Converting all input to wide and converting all output to utf-8 is a reasonable strategy. Working with utf-8 is reasonable too. A lot depends on your application. C++11 has built-in UTF8, UTF16 and UTF32 string types that simplify the task somewhat.

Whatever you do, don't use elements of the extended character set in string literals. (In C++11 it's OK to use them in UTF8/16/32 string literals).

Upvotes: 1

Related Questions