Reputation: 3243
We are specifically eyeing Windows and Linux development, and have come up with two differing approaches that both seem to have their merits. The natural unicode string type in Windows is UTF-16, and UTF-8 in linux.
We can't decide whether the best approach:
Standardise on one of the two in all our application logic (and persistent data), and make the other platforms do the appropriate conversions
Use the natural format for the OS for application logic (and thus making calls into the OS), and convert only at the point of IPC and persistence.
To me they seem like they are both about as good as each other.
Upvotes: 10
Views: 2855
Reputation: 3243
This seems to be quite enlightening on the topic. http://www.utf8everywhere.org/
Upvotes: 0
Reputation: 7593
and UTF-8 in linux.
It's mostly true for modern Linux. Actually encoding depends on what API or library is used. Some hardcoded to use UTF-8. But some read LC_ALL, LC_CTYPE or LANG environment variables to detect encoding to use (like Qt library). So be careful.
We can't decide whether the best approach
As usual it depends.
If 90% of code is to deal with platform specific API in platform specific way, obviously it is better to use platform specific strings. As an example - a device driver or native iOS application.
If 90% of code is complex business logic that is shared across platforms, obviously it is better to use same encoding on all platforms. As an example - chat client or browser.
In second case you have a choice:
If working with strings is a significant part of your application, choosing a nice library for strings is a good move. For example Qt has a very solid set of classes that covers 99% of common tasks. Unfortunately, I has no ICU experience, but it also looks very nice.
When using some library for strings you need to care about encoding only when working with external libraries, platform API or sending strings over the net (or disk). For example, a lot of Cocoa, C# or Qt (all has solid strings support) programmers know very little about encoding details (and it is good, since they can focus on their main task).
My experience in working with strings is a little specific, so I personally prefer bare pointers. Code that use them is very portable (in sense it can be easily reused in other projects and platforms) because has less external dependencies. It is extremely simple and fast also (but one probably need some experience and Unicode background to feel that).
I agree that bare pointers approach is not for everyone. It is good when:
From my a little specific experience it is actually a very common case.
When working with bare pointers it is good to choose encoding that will be used in entire project (or in all projects).
From my point of view, UTF-8 is an ultimate winner. If you can't use UTF-8 - use strings library or platform API for strings - it will save you a lot of time.
Advantages of UTF-8:
(*) Until you need to lexical compare them, transform case (toUpper/toLower), change normalization form or something like this - if you do - use strings library or platform API.
Disadvantage is questionable:
So, I recommend to use UTF-8 as common encoding for project(s) that doesn't use any strings library.
But encoding is not the only question you need to answer.
There is such thing as normalization. To put it simple, some letters can be represented in several ways - like one glyph or like a combination of different glyphs. The common problem with this is that most of string compare functions treat them as different symbols. If you working on cross-platform project, choosing one of normalization forms as standard is a right move. This will save your time.
For example if user password contains "йёжиг" it will be differently represented (in both UTF-8 and UTF-16) when entered on Mac (that mostly use Normalization Form D) and on Windows (that mostly likes Normalization Form C). So if user registered under Windows with such password it will a problem for him to login under Mac.
In addition I would not recommend to use wchar_t (or use it only in windows code as a UCS-2/UTF-16 char type). The problem with wchar_t is that there is no encoding associated with it. It's just an abstract wide char that is larger than normal char (16 bits on Windows, 32 bits on most *nix).
Upvotes: 6
Reputation: 70411
C++11 provides the new string types u16string
and u32string
. Depending on the support your compiler versions deliver, and the expected life expectancy, it might be an idea to stay forward-compatible to those.
Other than that, using the ICU library is probably your best shot at cross-platform compatibility.
Upvotes: 0
Reputation: 7949
Programming with UTF-8 is difficult as lengths and offsets are mixed up. e.g.
std::string s = Something();
std::cout << s.substr(0, 4);
does not necessarily find the first 4 chars.
I would use whatever a wchar_t
is. On Windows that will be UTF-16. On some *nix platforms it might be UTF-32.
When saving to a file, I would recommend converting to UTF-8. That often makes the file smaller, and removes any platform dependencies due to differences in sizeof(wchar_t)
or to byte order.
Upvotes: -1
Reputation: 49283
I'd use the same encoding internally, and normalize the data at entry point. This will involve less code, less gotchas, and will allow you to use the same cross platform library for string processing.
I'd use unicode (utf-16) because it's simpler to handle internally and should perform better because of the constant length for each character. UTF-8 is ideal for output and storage because it's backwards compliant with latin ascii, and unly uses 8 bits for English characters. But inside the program 16-bit is simpler to handle.
Upvotes: 0