Reputation: 444
Practical question - I'm working on a little piece of app which runs on 2 separate hardware platforms.
The compilation method and it's configuration is defined and controlled by me.
My app receives a UTF-8/ISO-8859 text , and should perform some basic manipulation on the string (copying, searching etc).
Thing is, one compiler is GCC (sizeof(wchar_t) == 4) and the other is Mingw(sizeof(wchar_t) == 2).
In order to support all UTF-8 possibilities, I was thinking of "typedef"in wchar_t in my code to be of type uint32_t, so that will force the Mingw compiler to be on the same line, and covering all UTF-8 options.
I'm then planning to use the wide-char manipulation functions as provided by the standard library (mbstowcs, wcscmp, wcscpy, ex..)
Question is, will "forcing" the compiler to use more room, could have some bad impact (besides performance) on the library functioning (will mbtowcs even work here after the change? )
I tried using ICU, but it is a very large library and thus breaks the deal. i need it small and reliable .
Thanks
Upvotes: 0
Views: 3015
Reputation: 213228
Here are your options for string manipulation:
Use unsigned char
(or char
) and UTF-8. All the regular string manipulation functions work (like strlen()
, strstr()
, snprintf()
, etc.).
Use wchar_t
and use a different encoding on different platforms (Win32 uses UTF-16, OS X and Linux use UTF-32). This is a path of madness, since you have to support two different encodings in the same code base.
Use UTF-32 or UTF-16 and your own string manipulation functions. This is a lot of work, but it is portable.
Use ICU and UTF-16.
For most purposes, manipulating strings in UTF-8 works very well. It depends on what your program does. If you are doing things like parsing and templating, UTF-8 is easy to work with. If you need more sophisticated functionality, such as iterating over break points or finding grapheme cluster boundaries, then you will need a library like Glib (which uses UTF-8) or ICU (which uses UTF-16).
You may be used to indexing strings using character / code point indexes. Get used to indexing strings using code unit indexes: so strlen()
returns the number of bytes, not the number of characters. However, it is very rare to actually need to index a string by character position.
Upvotes: 6