Any caveats when searching for a UTF-8 code point in a string?

Question

If I have some string to be searched in UTF-8 format and another to search for, also in UTF-8 format, are there any caveats to doing a straight up comparison search for the codepoint to pinpoint a matching character?

With the way UTF-8 works, it is possible to ever get a false positive?

I've read a lot of documentation about how great UTF-8 is but I'm having trouble forming a proof to answer this question.

If I search forward then I could skip along the length of a codepoint; but it's walking the string in reverse which worries me.

Instead of walking backwards until I hit the start of a codepoint and then doing a memory comparison from that address, is it safe to simply walk backwards along each byte until I get a full match against the search string?

user149341 · Accepted Answer

Nope. There are no caveats here; this operation is perfectly safe in UTF-8.

Recall that UTF-8 represents characters using two general forms:

ASCII characters (U+0000 through U+007F), which are all represented literally using a single byte in the range 0x00-0x7F.
All other characters, which are represented by a sequence which includes:
- A leading byte, in the range 0xC2-0xF4, which encodes part of the character data as well as the length of the sequence to follow.
- One or more continuation bytes in the range 0x80-0xBF, which encodes part of the remainder of a character.

Since there is no overlap between leading and continuation bytes, accidentally starting a search in the middle of a multi-byte character is fine. You won't find your match, because the string you're searching for won't start with a continuation byte, but you won't find any false positives either.

Any caveats when searching for a UTF-8 code point in a string?

Answers (2)

Related Questions