Reputation: 1152
Is there a way to portably (that is, conforming to the C standard) convert strings in the host character encoding to an array of Unicode code points? I'm working on some data serialization software, and I've got a problem because while I need to send UTF-8 over the wire, the C standard doesn't guarantee the ASCII encoding, so converting a string in the host character encoding can be a nontrivial task.
Is there a library that takes care of this kind of stuff for me? Is there a function hidden in the C standard library that can do something like this?
Upvotes: 0
Views: 528
Reputation: 754550
The C11 standard, ISO/IEC 9899:2011, has a new header <uchar.h>
with rudimentary facilities to help. It is described in section §7.28 Unicode utilities <uchar.h>
.
There are two pairs of functions defined:
c16rtomb()
and mbrtoc16()
— using type char16_t
aka uint_least16_t
.c32rtomb()
and mbrtoc32()
— using type char32_t
aka uint_least32_t
.The r
in the name is for 'restartable'; the functions are intended to be called iteratively. The mbrtoc{16,32}()
pair convert from a multibyte code set (hence the mb
) to either char16_t
or char32_t
. The c{16,32}rtomb()
pair convert from either char16_t
or char32_t
to a multibyte character sequence.
I'm not sure whether they'll do what you want. The <uchar.h>
header and hence the functions are not available on Mac OS X 10.9.1 with either the Apple-provided clang
or with the 'home-built' GCC 4.8.2, so I've not had a chance to investigate them. The header does appear to be available on Linux (Ubuntu 13.10) with GCC 4.8.1.
I think it likely that ICU is a better choice — it is, however, a rather large library (but that is because it does a thorough job of supporting Unicode in general and different locales in general).
Upvotes: 2