Reputation: 2614
If I have some string to be searched in UTF-8 format and another to search for, also in UTF-8 format, are there any caveats to doing a straight up comparison search for the codepoint to pinpoint a matching character?
With the way UTF-8 works, it is possible to ever get a false positive?
I've read a lot of documentation about how great UTF-8 is but I'm having trouble forming a proof to answer this question.
If I search forward then I could skip along the length of a codepoint; but it's walking the string in reverse which worries me.
Instead of walking backwards until I hit the start of a codepoint and then doing a memory comparison from that address, is it safe to simply walk backwards along each byte until I get a full match against the search string?
Upvotes: 3
Views: 415
Reputation: 48655
It is actually possible to deduce the byte-size of a code-point from its first byte, so you can skip along in the forward direction like that. However your direct pattern matching approach should also work fine as continuation bytes are bitwise distinct from initial code-point bytes.
See here for the bit-patterns: https://en.wikipedia.org/wiki/UTF-8#Description
Also, because the continuation bytes are bitwise distinct from the initial byte of each code point, 'walking back' to find the initial code-point byte is easy. However you should also have no problem with your proposed scheme of a reverse pattern match.
Upvotes: 0
Reputation:
Nope. There are no caveats here; this operation is perfectly safe in UTF-8.
Recall that UTF-8 represents characters using two general forms:
ASCII characters (U+0000 through U+007F), which are all represented literally using a single byte in the range 0x00-0x7F
.
All other characters, which are represented by a sequence which includes:
0xC2-0xF4
, which encodes part of the character data as well as the length of the sequence to follow.0x80-0xBF
, which encodes part of the remainder of a character.Since there is no overlap between leading and continuation bytes, accidentally starting a search in the middle of a multi-byte character is fine. You won't find your match, because the string you're searching for won't start with a continuation byte, but you won't find any false positives either.
Upvotes: 7