jkj yuio
jkj yuio

Reputation: 2613

How to make std::regex match Utf8

I would like a pattern like ".c", match "." with any utf8 followed by 'c' using std::regex.

I've tried under Microsoft C++ and g++. I get the same result, each time the "." only matches a single byte.

here's my test case:

#include <stdio.h>
#include <iostream>
#include <string>
#include <regex>

using namespace std;

int main(int argc, char** argv)
{
    // make a string with 3 UTF8 characters
    const unsigned char p[] = { 'a', 0xC2, 0x80, 'c', 0 };
    string tobesearched((char*)p);

    // want to match the UTF8 character before c
    string pattern(".c");
    regex re(pattern);

    std::smatch match;
    bool r = std::regex_search(tobesearched, match, re);
    if (r)
    {
        // m.size() will be bytes, and we expect 3
        // expect 0xC2, 0x80, 'c'

        string m = match[0];
        cout << "match length " << m.size() << endl;

        // but we only get 2, we get the 0x80 and the 'c'.
        // so it's matching on single bytes and not utf8
        // code here is just to dump out the byte values.
        for (int i = 0; i < m.size(); ++i)
        {
            int c = m[i] & 0xff;
            printf("%02X ", c);
        }
        printf("\n");
    }
    else
        cout << "not matched\n";

    return 0;
}

I wanted the pattern ".c" to match 3 bytes of my tobesearched string, where the first two are a 2-byte utf8 character followed by 'c'.

Upvotes: 1

Views: 2968

Answers (1)

Impact
Impact

Reputation: 151

Some regex flavours support \X which will match a single unicode character, which may consist of a number of bytes depending on the encoding. It is common practice for regex engines to get the bytes of the subject string in an encoding the engine is designed to work with, so you shouldn't have to worry about the actual encoding (whether it is US-ASCII, UTF-8, UTF-16 or UTF-32).

Another option is the \uFFFF where FFFF refers to the unicode character at that index in the unicode charset. With that, you could create a ranged match inside a character class i.e. [\u0000-\uFFFF]. Again, it depends on what the regex flavour supports. There is another variant of \u in \x{...} which does the same thing, except the unicode character index must be supplied inside curly braces, and need not be padded e.g. \x{65}.

Edit: This website is amazing for learning more about regex across various flavours https://www.regular-expressions.info

Edit 2: To match any Unicode-exclusive character, i.e. excluding characters in the ASCII table / 1 byte characters, you can try "[\x{80}-\x{FFFFFFFF}]" i.e. any character that has a value of 128-4,294,967,295 which is from the first character outside the ASCII range to the last unicode charset index which currently uses up to a 4-byte representation (was originally to be 6, and may change in future).

A loop through the individual bytes would be more efficient, though:

  1. If the lead bit is 0, i.e. if its signed value is > -1, it is a 1 byte char representation. Skip to the next byte and start again.
  2. Else if the lead bits are 11110 i.e. if its signed value is > -17, n=4.
  3. Else if the lead bits are 1110 i.e. if its signed value is > -33, n=3.
  4. Else if the lead bits are 110 i.e. if its signed value is > -65, n=2.
  5. Optionally, check that the next n bytes each start with 10, i.e. for each byte, if it has a signed value < -63, it is invalid UTF-8 encoding.
  6. You now know that the previous n bytes constitute a unicode-exclusive character. So, if the NEXT character is 'c' i.e. == 99, you can say it matched - return true.

Upvotes: 2

Related Questions