Reputation: 357
I need to write an app on embedded device using C++. I may need to support Unicode too (though I am not an expert on it). I had a look at Joel Spoolsky's article too about Unicode: http://www.joelonsoftware.com/articles/Unicode.html
My question is given what I mentioned above, what is the way to go with Unicode in such a application in C++? Should I use wchar_t everywhere? or std::wstring?
What problems I may encounter in using wchar_t all the time? (this post mentions some problems one might encounter with unicode strings: Switching from std::string to std::wstring for embedded applications? - but I am still kind of confused as to don't know what to do exactly).
Upvotes: 1
Views: 547
Reputation: 70391
"Supporting" Unicode goes well beyond using wchar_t
or std::wstring
(which are merely "types suitable for some wide-character encoding which might or might not be actually Unicode depending on current locale and platform").
Think things like isalpha()
, tokenizing, coverting to / from different encodings etc., and you get the idea.
Unless you know you can get away with build-in stuff like wchar_t
/ std::wstring
(and you wouldn't ask in that case), you are better off using the ICU library, which is the state-of-the-art implementation for Unicode support. (Even the otherwise-recommendable Boost.Locale relies on ICU to provide the actual logic.)
The C way of doing Unicode in ICU are arrays of type UChar []
(UTF-16), the C++ way is the class icu::UnicodeString
. I happen to work with a legacy codebase that goes great lengths to "make do" with UChar []
for claims of performance (shared references, memory pooling, copy-on-write etc.), but still fails to outperform icu::UnicodeString
, so you might feel safe in using the latter even in an embedded environment. They did a good job there.
Post scriptum: Take note that wchar_t
is of implementation-defined length; 32bit on the Unixes I know of, and 16bit on Windows - which gives additional trouble since wchar_t
should be "wide", but UTF-16 is still "multibyte" when it comes to Unicode. If you can rely on the environment supporting C++11, char16_t
resp. char32_t
would be better choices, yet still agnostic of finer print like combining characters.
Upvotes: 6
Reputation: 10988
You've read Joel's article, but it seems you have not understood it. std::wstring or strings of wchar_t are not Unicode, they are wide character strings that may contain UCS-2 or UTF-16 Unicode strings, or something else. std::string may contain plain ASCII, or ANSI w. codepage strings, or they may contain UTF-8 Unicode strings, or something else.
Both of these occur often: the std::wstring tends to be UTF-16 on Windows, std::string tends to be UTF-8 on POSIX.
DevSolar's advice is sound - have a look at ICU instead, it'll save you from an awful lot of headache and misunderstanding.
Upvotes: 0