Reputation: 40319
Suppose we have an arbitrary string, s.
s has the property of being from just about anywhere in the world. People from USA, Japan, Korea, Russia, China and Greece all write into s from time to time. Fortunately we don't have time travellers using Linear A, however.
For the sake of discussion, let's presume we want to do string operations such as:
and, just because this is for the sake of discussion, let's presume we want to write these routines ourselves (instead of grabbing a library), and we have no legacy software to maintain.
There are 3 standards for Unicode: utf-8, utf-16, and utf-32, each with pros and cons. But let's say I'm sorta dumb, and I want one Unicode to rule them all (because rolling a dynamically adapting library for 3 different kinds of string encodings that hides the difference from the API user sounds hard).
--
The point of this question is to educate myself and others in useful and usable information for Unicode: reading the RFCs is fine, but there's a 'stack' of information related to compilers, languages, and operating systems that the RFCs do not cover, but is vital to know to actually use Unicode in a real app.
Upvotes: 16
Views: 1270
Reputation: 106609
Which encoding is most general
Probably UTF-32, though all three formats can store any character. UTF-32 has the property that every character can be encoded in a single codepoint.
Which encoding is supported by wchar_t
None. That's implementation defined. On most Windows platforms it's UTF-16, on most Unix platforms its UTF-32.
Which encoding is supported by the STL
None really. The STL can store any type of character you want. Just use the std::basic_string<t>
template with a type large enough to hold your code point. Most operations (e.g. std::reverse
) do not know about any sort of unicode encoding though.
Are these encodings all(or not at all) null-terminated?
No. Null is a legal value in any of those encodings. Technically, NULL is a legal character in plain ASCII too. NULL termination is a C thing -- not an encoding thing.
Choosing how to do this has a lot to do with your platform. If you're on Windows, use UTF-16 and wchar_t strings, because that's what the Windows API uses to support unicode. I'm not entirely sure what the best choice is for UNIX platforms but I do know that most of them use UTF-8.
Upvotes: 9
Reputation: 494
In response to your final bullet, UTF-8 is guaranteed not to have NULL bytes in its encoding of any character (except NULL itself, of course). As a result, many functions that work with NULL-terminated strings also work with UTF-8 encoded strings.
Upvotes: 2
Reputation: 24561
Define "real app" :)
Seriously, the decision really depends a lot on the kind of software you are developing. If your target platform is Win32 API (with or without wrappers such as MFC, WTL, etc) you would probably want to use wstring
types with the text encoded as UTF-16. That's simply because all Win32 API internally uses that encoding anyway.
On another hand, if your output is something like XML/HTML and/or needs to be delivered over the internet, UTF-8 is pretty much the standard - it is usually transmitted well via protocols that make assumptions about characters having 8 bits.
As for UTF-32, I can't think of a single reason to use it, unless you need 1:1 mapping between code units and code points (that still does not mean 1:1 mapping between code units and characters!).
For more information, be sure to look at Unicode.org. This FAQ may be a good starting point.
Upvotes: 1
Reputation: 5637
Have a look at the open source library ICU, especially at the Docs & Papers section. It's an extensive library dealing with all sorts of unicode oddities.
Upvotes: 5