Reputation: 11562
From what I understand it is very rare for UTF-8 strings to have embedded NULLs, however there is the case that a person can put a NULL into a Unicode string explicitly with "X\0Y" or something like that. Apparently the Unicode standard supports an embedded NULL in this way. However, as far as I can see there is no use of NULLs outside of this, or at least no common use of NULLs in UTF-8 encodings.
So, the question is: if I am allowing users of my software to use any UTF-8 string, am I taking a significant risk processing those strings with functions that assume strings are NULL terminated? I guess what I am asking is that I don't know how often I might encounter an embedded NULL "in the wild".
Upvotes: 3
Views: 110
Reputation: 12708
The ASCII NULL character used to terminate a string encodes in UTF8 as the single byte 0x00
, so there should be no problem only if you consider that your UTF8 C strings end when you first find the UNICODE U+0000
codepoint. You can decode the 0x00
byte as part of the string into codepoint U+0000
or not. String functions do require the string to be null terminated, anyway. But if you are going to allow codepoint U+0000
as part of a UNICODE string, then you will need to use another way to distinguish a null byte (0x00
) representing U+0000
from the UTF8 encoding of codepoint U+0000
and so ensure transparency. I've seen software that distinguishes the end of a string in UTF8 by using an invalid UTF8 sequence (like 0xc0 0x80
, which should decode --if used the standard algorithm to decode it into a code point-- to codepoint U+0000
, but is not allowed as a UTF8 encoding of it) In this case you would be able to encode strings with embedded codepoints U+0000
. You can consider that 0xc0 0x80
is a escape you use to indicate the U+0000
and then you will be able to do things like search for the length (in bytes, not in codepoints) of the string with strlen()
or not, if you use it as the final string delimiter.
Upvotes: 2
Reputation: 120059
Definition (from any C standard):
A string is a contiguous sequence of characters terminated by and including the first null character
It follows that there are no embedded null characters in C strings.
Your users cannot put a null character in the middle of a string. It's physically and logically impossible. The string ends where the null character is, by definition.
A user can put a null character in the middle of a character array (which is not a string) or in the middle of a file (which is also not a string). How to deal with those is up to you. The encoding is irrelevant. UTF-8 does not pose any additional challenges in this regard compared to any other encoding.
Some multibyte encodings allow zero bytes in characters that are not the null character. If you use such encoding, you may need to be a little bit extra careful so that not to confuse a zero byte with the null character. UTF-8 is not one of those. A zero byte always represents the null character in UTF-8.
Upvotes: 10