Nick
Nick

Reputation: 10559

Can I store UTF8 in C-style char array

Follow up
Can UTF-8 contain zero byte?

Can I safely store UTF8 string in zero terminated char * ?

I understand strlen() will not return correct information, put "storing", printing and "transferring" the char array, seems to be safe.

Upvotes: 1

Views: 1467

Answers (2)

Joop Eggen
Joop Eggen

Reputation: 109613

In C a 0 byte is the string terminator. As long as the Unicode point 0, U+0000 is not in the Unicode string there is no problem.

To be able to store 0 bytes in Unicode, one may use modified UTF-8 that convert not only code points >= 128, but also 0 to a multi-byte sequence (every byte thereof having its high bit set, >= 128). This is done in java for some APIs, like DataOutputStream.writeUTF. It ensures you can transmit strings with an embedded 0.

It formally is no longer UTF-8 as UTF-8 requires the shortest encoding. Also this is only possible when determining the length i.o. strlen when unpacking to non-UTF-8.

So the most feasible solution is not to accept U+0000 in strings.

Upvotes: 1

unwind
unwind

Reputation: 400129

Yes.

Just like with ASCII and similiar 8-bit encodings before Unicode, you can't store the NUL character in such a string (the value \u+0000 is the Unicode code point NUL, very much like in ASCII).

As long as you know your strings don't need to contain that (and regular text doesn't), it's fine.

Upvotes: 3

Related Questions