Ricky Stewart
Ricky Stewart

Reputation: 1152

Converting string in host character encoding to Unicode in C

Is there a way to portably (that is, conforming to the C standard) convert strings in the host character encoding to an array of Unicode code points? I'm working on some data serialization software, and I've got a problem because while I need to send UTF-8 over the wire, the C standard doesn't guarantee the ASCII encoding, so converting a string in the host character encoding can be a nontrivial task.

Is there a library that takes care of this kind of stuff for me? Is there a function hidden in the C standard library that can do something like this?

Upvotes: 0

Views: 528

Answers (1)

Jonathan Leffler
Jonathan Leffler

Reputation: 754550

The C11 standard, ISO/IEC 9899:2011, has a new header <uchar.h> with rudimentary facilities to help. It is described in section §7.28 Unicode utilities <uchar.h>.

There are two pairs of functions defined:

  • c16rtomb() and mbrtoc16() — using type char16_t aka uint_least16_t.
  • c32rtomb() and mbrtoc32() — using type char32_t aka uint_least32_t.

The r in the name is for 'restartable'; the functions are intended to be called iteratively. The mbrtoc{16,32}() pair convert from a multibyte code set (hence the mb) to either char16_t or char32_t. The c{16,32}rtomb() pair convert from either char16_t or char32_t to a multibyte character sequence.

I'm not sure whether they'll do what you want. The <uchar.h> header and hence the functions are not available on Mac OS X 10.9.1 with either the Apple-provided clang or with the 'home-built' GCC 4.8.2, so I've not had a chance to investigate them. The header does appear to be available on Linux (Ubuntu 13.10) with GCC 4.8.1.

I think it likely that ICU is a better choice — it is, however, a rather large library (but that is because it does a thorough job of supporting Unicode in general and different locales in general).

Upvotes: 2

Related Questions