Zhro
Zhro

Reputation: 2614

Any caveats when searching for a UTF-8 code point in a string?

If I have some string to be searched in UTF-8 format and another to search for, also in UTF-8 format, are there any caveats to doing a straight up comparison search for the codepoint to pinpoint a matching character?

With the way UTF-8 works, it is possible to ever get a false positive?

I've read a lot of documentation about how great UTF-8 is but I'm having trouble forming a proof to answer this question.

If I search forward then I could skip along the length of a codepoint; but it's walking the string in reverse which worries me.

Instead of walking backwards until I hit the start of a codepoint and then doing a memory comparison from that address, is it safe to simply walk backwards along each byte until I get a full match against the search string?

Upvotes: 3

Views: 415

Answers (2)

Galik
Galik

Reputation: 48655

It is actually possible to deduce the byte-size of a code-point from its first byte, so you can skip along in the forward direction like that. However your direct pattern matching approach should also work fine as continuation bytes are bitwise distinct from initial code-point bytes.

See here for the bit-patterns: https://en.wikipedia.org/wiki/UTF-8#Description

Also, because the continuation bytes are bitwise distinct from the initial byte of each code point, 'walking back' to find the initial code-point byte is easy. However you should also have no problem with your proposed scheme of a reverse pattern match.

Upvotes: 0

user149341
user149341

Reputation:

Nope. There are no caveats here; this operation is perfectly safe in UTF-8.

Recall that UTF-8 represents characters using two general forms:

  • ASCII characters (U+0000 through U+007F), which are all represented literally using a single byte in the range 0x00-0x7F.

  • All other characters, which are represented by a sequence which includes:

    • A leading byte, in the range 0xC2-0xF4, which encodes part of the character data as well as the length of the sequence to follow.
    • One or more continuation bytes in the range 0x80-0xBF, which encodes part of the remainder of a character.

Since there is no overlap between leading and continuation bytes, accidentally starting a search in the middle of a multi-byte character is fine. You won't find your match, because the string you're searching for won't start with a continuation byte, but you won't find any false positives either.

Upvotes: 7

Related Questions