Reputation: 871
I want that strings with Unicode characters be correctly handled in my file synchronizer application but I don't know how this kind of encoding works ?
In a unicode string, I can see that a unicode char has this form : "\uxxxx" where xs are numbers, how a normal C or C++ program interpret this kind of char ? (why there is a 'u' after '\' ? what's the effect ?)
In internet I see examples using "wide strings or wchar_t ?? So, what's the suitable object to handle unicode characters ? In rapidJson (which supports Unicode, UTF-8, UTF-16, UTF-32) , we can use const char* to store a JSOn that could have "wide characters" but those characters take more than a byte to be represented... I don't understand...
This is the kind of temporary arrangement I found for the moment (unicode->utf8?ascii?, listFolder is a std::string) :
boost::replace_all(listFolder, "\\u00e0", "à");
boost::replace_all(listFolder, "\\u00e2", "â");
boost::replace_all(listFolder, "\\u00e4", "ä");
...
Upvotes: 0
Views: 3985
Reputation: 70213
The suitable object to handle Unicode strings in C++ is icu::UnicodeString (check "API References, ICU4C" in the sidebar), at least if you want to really handle Unicode strings (as opposed to just passing them from one point of your application to another).
wchar_t
was an early attempt at handling international character sets, which turned out to be a failure because Microsoft's definition of wchar_t
as two bytes turned out to be insufficient once Unicode was extended beyond code point 0x10000. Linux defines wchar_t
as four bytes, but the inconsistency makes it (and its derived std::wstring
) rather useless for portable programming.
TCHAR
is a Microsoft define that resolves to char
by default and to WCHAR
if UNICODE
is defined, with WCHAR
in turn being wchar_t
behind a level of indirection... yeah.
C++11 brought us char16_t
and char32_t
as well as the corresponding string classes, but those are still instances of basic_string<>
, and as such have their shortcomings e.g. when trying to uppercase / lowercase characters that have more than one replacement character (e.g. the German ß
would require to be extended to SS
in uppercase; the standard library cannot do that).
ICU, on the other hand, goes the full way. For example, it provides normalization and decomposition, which the standard strings do not.
\uxxxx
and \UXXXXXXXX
are unicode character escapes. The xxxx
are a 16-bit hexadecimal number representing a UCS-2 code point, which is equivalent to a UTF-16 code point within the Basic Multilingual Plane.
The XXXXXXXX
are a 32-bit hex number, representing a UTF-32 code point, which may be any plane.
How those character escapes are handled depends on the context in which they appear (narrow / wide string, for example), making them somewhat less than perfect.
C++11 introduced "proper" Unicode literals:
u8"..."
is always a const char[]
in UTF-8 encoding.
u"..."
is always a const uchar16_t[]
in UTF-16 encoding.
U"..."
is always a const uchar32_t[]
in UTF-32 encoding.
If you use \uxxxx
or \UXXXXXXXX
within one of those three, the character literal will always be expanded to the proper code unit sequence.
Note that storing UTF-8 in a std::string
is possible, but hazardous. You need to be aware of many things: .length()
is not the number of characters in your string. .substr()
can lead to partial and invalid sequences. .find_first_of()
will not work as expected. And so on.
That being said, in my opinion UTF-8 is the only sane encoding choice for any stored text. There are cases to be made for handling texts as UTF-16 in-memory (the way ICU does), but on file, don't accept anything but UTF-8. It's space-efficient, endianess-independent, and allows for semi-sane handling even by software that is blissfully unaware of Unicode matters (see caveats above).
Upvotes: 5
Reputation: 238291
In a unicode string, I can see that a unicode char has this form : "\uxxxx" where xs are numbers, how a normal C or C++ program interpret this kind of char ? (why there is a 'u' after '\' ? what's the effect ?)
That is a unicode character escape sequence. It will be interpreted as a unicode character. The u
after the escape character is part of the syntax and it's what differentiates it from other escape sequences. Read the documentation for more information.
So, what's the suitable object to handle unicode characters ?
char
for uft-8char16_t
for utf-16char32_t
for utf-32wchar_t
is platform dependent, so you cannot make portable assumptions of which encoding it suits.we can use const char* to store a JSOn that could have "wide characters" but those characters take more than a byte to be represented...
If you mean that you can store multi-byte utf-8 characters in a char
string, then you're correct.
This is the kind of temporary arrangement I found for the moment (unicode->utf8?ascii?, listFolder is a std::string)
What you're attempting to do there is replacing some unicode characters with characters that have a plaftorm defined encoding. If you have other unicode characters besides those, then you end up with a string that has mixed encoding. Also, in some cases it may accidentally replace parts of other byte sequences. I recommend using library to convert encoding or do any other manipulation on encoded strings.
Upvotes: 2