AlexM
AlexM

Reputation: 325

German Umlaute and Regular Expressions

I've encountered this strange phenomenon severeal times now. If I use an ifstream to feed a program with the content of a file and apply a regular expression to the incoming words, the German letters ä ö ü provide me with some difficulties. If any one of these appears at the begining of a word, the regular expression fails to recognize them, but not if any one of these letters appears within the word. So these lines

string word = "über";
regex check {R"(\b)" + word + R"(\b)", regex_constants::icase};
string search = "Es war genau über ihm.";

won't work because the regex fails to find über in the string search. However,

string word = "für";
regex check {R"(\b)" + word + R"(\b)", regex_constants::icase};
string search = "Es war für ihn.";

will work because the ü appears in the word. Why is that and how can I fix this? I've thought about replacing every ü by ue and every ä by ae and every ö by oe and later undo the replacement, but is there yet another possibility? I'm working with Visual Studio 2015.

Upvotes: 2

Views: 1565

Answers (1)

cshu
cshu

Reputation: 5944

Use regex check {"(^|[\\x60\\x00-\\x2f\\x3a-\\x40\\x5b-\\x5e\\x7b-\\x7e])über($|[\\x60\\x00-\\x2f\\x3a-\\x40\\x5b-\\x5e\\x7b-\\x7e])", regex_constants::icase}; instead.

The default grammar of C++ regex is similar to JavaScript. \b doesn't support Unicode.

And from microsoft.com:

Word Boundary

A word boundary occurs in the following situations:

  • The current character is at the beginning of the target sequence and is one of the word characters A-Za-z0-9_.

  • The current character position is past the end of the target sequence and the last character in the target sequence is one of the word characters.

  • The current character is one of the word characters and the preceding character is not.

  • The current character is not one of the word characters and the preceding character is.

Upvotes: 1

Related Questions