Length of a utf16 string as a utf8 string

Question

I have a utf16 wchar_t* that I need to convert and dump into a utf8 char*. I'm using std::wcstombs to do it and am using the length of the wchar_t* for the max length.

I'm a bit fuzzy on the way utf encoding works though, IIRC, a single character could take up multiple bytes in which case I could possibly lose some characters when doing it like that.

Currently the characters that could come up are pretty limited and would probably fit even in ASCII charset but later on, I'm planning to allow more, such as öäõü and the likes. Am I going to have a problem there? If so, how would I measure the length of the buffer I need to allocate?

rici · Accepted Answer

Codepoints in the BMP ("Basic Multilingual Plane", i.e. those whose values are not greater than 0xFFFF) require one UTF-16 codeunit or up to three UTF-8 codeunits. Outside of the BMP, a codepoint requires two UTF-16 codeunits (a surrogate pair) or four UTF-8 codeunits.

If your wchar_t is two bytes (UTF-16), in the worst case, the UTF-8 string could require three bytes for an individual wchar_t (that is 50% more memory), and 4 bytes for a surrogate pair (that is the same amount of memory).

If your wchar_t is four bytes (UTF-32), though, non-BMP characters will only require one wchar_t, so the worst case is four bytes for every wchar_t, which is the same amount of memory.

Only allowing one byte for each wchar_t will definitely get you into trouble. That will only work if you have no characters outside of the basic ASCII character set.

Length of a utf16 string as a utf8 string

Answers (1)

Related Questions