Carl Seleborg
Carl Seleborg

Reputation: 13305

C++ strings: UTF-8 or 16-bit encoding?

I'm still trying to decide whether my (home) project should use UTF-8 strings (implemented in terms of std::string with additional UTF-8-specific functions when necessary) or some 16-bit string (implemented as std::wstring). The project is a programming language and environment (like VB, it's a combination of both).

There are a few wishes/constraints:

Currently, I'm working with std::string, with the intent of using UTF-8 manipulation functions only when necessary. It requires less memory, and seems to be the direction many applications are going anyway.

If you recommend a 16-bit encoding, which one: UTF-16? UCS-2? Another one?

Upvotes: 12

Views: 18939

Answers (8)

basszero
basszero

Reputation: 30014

MicroATX is pretty much a standard PC motherboard format, most capable of 4-8 GB of RAM. If you're talking picoATX maybe you're limited to 1-2 GB RAM. Even then that's plenty for a development environment. I'd still stick with UTF-8 for reasons mentioned above, but memory shouldn't be your concern.

Upvotes: 2

user19050
user19050

Reputation: 160

I would recommend UTF-16 for any kind of data manipulation and UI. The Mac OS X and Win32 API uses UTF-16, same for wxWidgets, Qt, ICU, Xerces, and others. UTF-8 might be better for data interchange and storage. See http://unicode.org/notes/tn12/.

But whatever you choose, I would definitely recommend against std::string with UTF-8 "only when necessary".

Go all the way with UTF-16 or UTF-8, but do not mix and match, that is asking for trouble.

Upvotes: 2

MSalters
MSalters

Reputation: 179991

I've actually written a widely used application (5million+ users) so every kilobyte used adds up, literally. Despite that, I just stuck to wxString. I've configured it to be derived from std::wstring, so I can pass them to functions expecting a wstring const&.

Please note that std::wstring is native Unicode on the Mac (no UTF-16 needed for characters above U+10000), and therefore it uses 4 bytes/wchar_t. The big advantage of this is that i++ gets you the next character, always. On Win32 that is true in only 99.9% of the cases. As a fellow programmer, you'll understand how little 99.9% is.

But if you're not convinced, write the function to uppercase a std::string[UTF-8] and a std::wstring. Those 2 functions will tell you which way is insanity.

Your on-disk format is another matter. For portability, that should be UTF-8. There's no endianness concern in UTF-8, nor a discussion over the width (2/4). This may be why many programs appear to use UTF-8.

On a slightly unrelated note, please read up on Unicode string comparisions and normalization. Or you'll end up with the same bug as .NET, where you can have two variables föö and föö differing only in (invisible) normalization.

Upvotes: 4

Branan
Branan

Reputation: 1819

From what I've read, it's better to use a 16-bit encoding internally unless you're short on memory. It fits almost all living languages in one character

I'd also look at ICU. If you're not going to be using certain STL features of strings, using the ICU string types might be better for you.

Upvotes: 1

Nemanja Trifunovic
Nemanja Trifunovic

Reputation: 24561

If you decide to go with UTF-8 encoding, check out this library: http://utfcpp.sourceforge.net/

It may make your life much easier.

Upvotes: 5

Ferruccio
Ferruccio

Reputation: 100718

Have you considered using wxStrings? If I remember correctly, they can do utf-8 <-> Unicode conversions and it will make it a bit easier when you have to pass strings to and from the UI.

Upvotes: 0

Nick Johnson
Nick Johnson

Reputation: 101149

UTF-16 is still a variable length character encoding (there are more than 2^16 unicode codepoints), so you can't do O(1) string indexing operations. If you're doing lots of that sort of thing, you're not saving anything in speed over UTF-8. On the other hand, if your text includes a lot of codepoints in the 256-65535 range, UTF-16 can be a substantial improvement in size. UCS-2 is a variation on UTF-16 that is fixed length, at the cost of prohibiting any codepoints greater than 2^16.

Without knowing more about your requirements, I would personally go for UTF-8. It's the easiest to deal with for all the reasons others have already listed.

Upvotes: 26

Vargen
Vargen

Reputation: 726

I have never found any reasons to use anything else than UTF-8 to be honest.

Upvotes: 6

Related Questions