pseudonym_127
pseudonym_127

Reputation: 357

Unicode strings on a embedded software

I need to write an app on embedded device using C++. I may need to support Unicode too (though I am not an expert on it). I had a look at Joel Spoolsky's article too about Unicode: http://www.joelonsoftware.com/articles/Unicode.html

My question is given what I mentioned above, what is the way to go with Unicode in such a application in C++? Should I use wchar_t everywhere? or std::wstring?

What problems I may encounter in using wchar_t all the time? (this post mentions some problems one might encounter with unicode strings: Switching from std::string to std::wstring for embedded applications? - but I am still kind of confused as to don't know what to do exactly).

Upvotes: 1

Views: 547

Answers (2)

DevSolar
DevSolar

Reputation: 70391

"Supporting" Unicode goes well beyond using wchar_t or std::wstring (which are merely "types suitable for some wide-character encoding which might or might not be actually Unicode depending on current locale and platform").

Think things like isalpha(), tokenizing, coverting to / from different encodings etc., and you get the idea.

Unless you know you can get away with build-in stuff like wchar_t / std::wstring (and you wouldn't ask in that case), you are better off using the ICU library, which is the state-of-the-art implementation for Unicode support. (Even the otherwise-recommendable Boost.Locale relies on ICU to provide the actual logic.)

The C way of doing Unicode in ICU are arrays of type UChar [] (UTF-16), the C++ way is the class icu::UnicodeString. I happen to work with a legacy codebase that goes great lengths to "make do" with UChar [] for claims of performance (shared references, memory pooling, copy-on-write etc.), but still fails to outperform icu::UnicodeString, so you might feel safe in using the latter even in an embedded environment. They did a good job there.

Post scriptum: Take note that wchar_t is of implementation-defined length; 32bit on the Unixes I know of, and 16bit on Windows - which gives additional trouble since wchar_t should be "wide", but UTF-16 is still "multibyte" when it comes to Unicode. If you can rely on the environment supporting C++11, char16_t resp. char32_t would be better choices, yet still agnostic of finer print like combining characters.

Upvotes: 6

Joris Timmermans
Joris Timmermans

Reputation: 10988

You've read Joel's article, but it seems you have not understood it. std::wstring or strings of wchar_t are not Unicode, they are wide character strings that may contain UCS-2 or UTF-16 Unicode strings, or something else. std::string may contain plain ASCII, or ANSI w. codepage strings, or they may contain UTF-8 Unicode strings, or something else.

Both of these occur often: the std::wstring tends to be UTF-16 on Windows, std::string tends to be UTF-8 on POSIX.

DevSolar's advice is sound - have a look at ICU instead, it'll save you from an awful lot of headache and misunderstanding.

Upvotes: 0

Related Questions