dqmis
dqmis

Reputation: 489

Filtering string using regex in utf8 format

I am trying to filter strings that escapes special characters and transforms it into lowercase. For example: "Good morning!" is transformed into good morning.
I am passing one string at the time to my function.
I am successfully filtering my strings that are in English language but I have problems when I am passing strings that are in my native language.
What type of regex filter string should I use if I want to include all utf-8 characters?

#include <string>
#include <iostream>
#include <regex>
#include <algorithm>

std::string process(std::string s) {
    std::string st;
    std::regex r(R"([^\W_]+(?:['_-][^\W_]+)*)");
    std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), r);
    std::smatch m = *i;
    st = m.str();
    std::transform(st.begin(), st.end(), st.begin(), ::tolower);
    return st;
}

int main() {
    std::string st = "ąžuolas!";
    std::cout << process(st) << std::endl; // <- gives: uolas
    return 0;
}

Upvotes: 9

Views: 577

Answers (1)

Anmol Singh Jaggi
Anmol Singh Jaggi

Reputation: 8576

You can match any unicode 'letter' character using the regex \p{L}\p{M}*.

Therefore, the complete regex will be:

((?:\p{L}\p{M}*)+(?:['_-](?:\p{L}\p{M}*)+)*)

Demo

Source

Upvotes: 6

Related Questions