Tyler Durden
Tyler Durden

Reputation: 11562

If I am using UTF-8 strings is it risky to use standard string handling that assumes null termination?

From what I understand it is very rare for UTF-8 strings to have embedded NULLs, however there is the case that a person can put a NULL into a Unicode string explicitly with "X\0Y" or something like that. Apparently the Unicode standard supports an embedded NULL in this way. However, as far as I can see there is no use of NULLs outside of this, or at least no common use of NULLs in UTF-8 encodings.

So, the question is: if I am allowing users of my software to use any UTF-8 string, am I taking a significant risk processing those strings with functions that assume strings are NULL terminated? I guess what I am asking is that I don't know how often I might encounter an embedded NULL "in the wild".

Upvotes: 3

Views: 110

Answers (2)

Luis Colorado
Luis Colorado

Reputation: 12708

The ASCII NULL character used to terminate a string encodes in UTF8 as the single byte 0x00, so there should be no problem only if you consider that your UTF8 C strings end when you first find the UNICODE U+0000 codepoint. You can decode the 0x00 byte as part of the string into codepoint U+0000 or not. String functions do require the string to be null terminated, anyway. But if you are going to allow codepoint U+0000 as part of a UNICODE string, then you will need to use another way to distinguish a null byte (0x00) representing U+0000 from the UTF8 encoding of codepoint U+0000 and so ensure transparency. I've seen software that distinguishes the end of a string in UTF8 by using an invalid UTF8 sequence (like 0xc0 0x80, which should decode --if used the standard algorithm to decode it into a code point-- to codepoint U+0000, but is not allowed as a UTF8 encoding of it) In this case you would be able to encode strings with embedded codepoints U+0000. You can consider that 0xc0 0x80 is a escape you use to indicate the U+0000 and then you will be able to do things like search for the length (in bytes, not in codepoints) of the string with strlen() or not, if you use it as the final string delimiter.

Upvotes: 2

n. m. could be an AI
n. m. could be an AI

Reputation: 120059

Definition (from any C standard):

A string is a contiguous sequence of characters terminated by and including the first null character

It follows that there are no embedded null characters in C strings.

Your users cannot put a null character in the middle of a string. It's physically and logically impossible. The string ends where the null character is, by definition.

A user can put a null character in the middle of a character array (which is not a string) or in the middle of a file (which is also not a string). How to deal with those is up to you. The encoding is irrelevant. UTF-8 does not pose any additional challenges in this regard compared to any other encoding.

Some multibyte encodings allow zero bytes in characters that are not the null character. If you use such encoding, you may need to be a little bit extra careful so that not to confuse a zero byte with the null character. UTF-8 is not one of those. A zero byte always represents the null character in UTF-8.

Upvotes: 10

Related Questions