Reputation: 87
I don't know the following cases in GCC, who can help me?
Whether a valid UTF-8 character (except code point 0
) still contains zero byte? If so, I think function such as strlen
will break that UTF-8 character.
Whether a valid UTF-8 character contains a byte whose value is equal to '\n'
? If so, I think function such as "gets" will break that UTF-8 character.
Whether a valid UTF-8 character contains a byte whose value is equal to ' '
or '\t'
? If so, I think function such as scanf("%s%s")
will break that UTF-8 character and be interpreted as two or more words.
Upvotes: 0
Views: 133
Reputation: 122433
The answer to all your questions are the same: No.
It's one of the advantages of UTF-8: all ASCII bytes do not occur when encoding non-ASCII code points into UTF-8.
For example, you can safely use strlen
on a UTF-8 string, only that its result is the number of bytes instead of UTF-8 code points.
Upvotes: 5