Yarel
Yarel

Reputation: 444

compelling wchar_t to be 4 bytes

Practical question - I'm working on a little piece of app which runs on 2 separate hardware platforms.

The compilation method and it's configuration is defined and controlled by me.

My app receives a UTF-8/ISO-8859 text , and should perform some basic manipulation on the string (copying, searching etc).

Thing is, one compiler is GCC (sizeof(wchar_t) == 4) and the other is Mingw(sizeof(wchar_t) == 2).

In order to support all UTF-8 possibilities, I was thinking of "typedef"in wchar_t in my code to be of type uint32_t, so that will force the Mingw compiler to be on the same line, and covering all UTF-8 options.

I'm then planning to use the wide-char manipulation functions as provided by the standard library (mbstowcs, wcscmp, wcscpy, ex..)

Question is, will "forcing" the compiler to use more room, could have some bad impact (besides performance) on the library functioning (will mbtowcs even work here after the change? )

I tried using ICU, but it is a very large library and thus breaks the deal. i need it small and reliable .

Thanks

Upvotes: 0

Views: 3015

Answers (1)

Dietrich Epp
Dietrich Epp

Reputation: 213228

Here are your options for string manipulation:

  1. Use unsigned char (or char) and UTF-8. All the regular string manipulation functions work (like strlen(), strstr(), snprintf(), etc.).

  2. Use wchar_t and use a different encoding on different platforms (Win32 uses UTF-16, OS X and Linux use UTF-32). This is a path of madness, since you have to support two different encodings in the same code base.

  3. Use UTF-32 or UTF-16 and your own string manipulation functions. This is a lot of work, but it is portable.

  4. Use ICU and UTF-16.

For most purposes, manipulating strings in UTF-8 works very well. It depends on what your program does. If you are doing things like parsing and templating, UTF-8 is easy to work with. If you need more sophisticated functionality, such as iterating over break points or finding grapheme cluster boundaries, then you will need a library like Glib (which uses UTF-8) or ICU (which uses UTF-16).

A note about indexes

You may be used to indexing strings using character / code point indexes. Get used to indexing strings using code unit indexes: so strlen() returns the number of bytes, not the number of characters. However, it is very rare to actually need to index a string by character position.

Upvotes: 6

Related Questions