Reputation: 10559
Follow up
Can UTF-8 contain zero byte?
Can I safely store UTF8 string in zero terminated char *
?
I understand strlen()
will not return correct information, put "storing", printing and "transferring" the char array, seems to be safe.
Upvotes: 1
Views: 1467
Reputation: 109613
In C a 0 byte is the string terminator. As long as the Unicode point 0, U+0000 is not in the Unicode string there is no problem.
To be able to store 0 bytes in Unicode, one may use modified UTF-8 that convert not only code points >= 128, but also 0 to a multi-byte sequence (every byte thereof having its high bit set, >= 128). This is done in java for some APIs, like DataOutputStream.writeUTF. It ensures you can transmit strings with an embedded 0.
It formally is no longer UTF-8 as UTF-8 requires the shortest encoding. Also this is only possible when determining the length i.o. strlen when unpacking to non-UTF-8.
So the most feasible solution is not to accept U+0000 in strings.
Upvotes: 1
Reputation: 400129
Yes.
Just like with ASCII and similiar 8-bit encodings before Unicode, you can't store the NUL
character in such a string (the value \u+0000
is the Unicode code point NUL
, very much like in ASCII).
As long as you know your strings don't need to contain that (and regular text doesn't), it's fine.
Upvotes: 3