How to make std::regex match Utf8

Question

I would like a pattern like ".c", match "." with any utf8 followed by 'c' using std::regex.

I've tried under Microsoft C++ and g++. I get the same result, each time the "." only matches a single byte.

here's my test case:

#include 
#include 
#include 
#include 

using namespace std;

int main(int argc, char** argv)
{
    // make a string with 3 UTF8 characters
    const unsigned char p[] = { 'a', 0xC2, 0x80, 'c', 0 };
    string tobesearched((char*)p);

    // want to match the UTF8 character before c
    string pattern(".c");
    regex re(pattern);

    std::smatch match;
    bool r = std::regex_search(tobesearched, match, re);
    if (r)
    {
        // m.size() will be bytes, and we expect 3
        // expect 0xC2, 0x80, 'c'

        string m = match[0];
        cout << "match length " << m.size() << endl;

        // but we only get 2, we get the 0x80 and the 'c'.
        // so it's matching on single bytes and not utf8
        // code here is just to dump out the byte values.
        for (int i = 0; i < m.size(); ++i)
        {
            int c = m[i] & 0xff;
            printf("%02X ", c);
        }
        printf("
");
    }
    else
        cout << "not matched
";

    return 0;
}

I wanted the pattern ".c" to match 3 bytes of my tobesearched string, where the first two are a 2-byte utf8 character followed by 'c'.

How to make std::regex match Utf8

Answers (1)

Related Questions