German Umlaute and Regular Expressions

Question

I've encountered this strange phenomenon severeal times now. If I use an ifstream to feed a program with the content of a file and apply a regular expression to the incoming words, the German letters ä ö ü provide me with some difficulties. If any one of these appears at the begining of a word, the regular expression fails to recognize them, but not if any one of these letters appears within the word. So these lines

string word = "über";
regex check {R"(\b)" + word + R"(\b)", regex_constants::icase};
string search = "Es war genau über ihm.";

won't work because the regex fails to find über in the string search. However,

string word = "für";
regex check {R"(\b)" + word + R"(\b)", regex_constants::icase};
string search = "Es war für ihn.";

will work because the ü appears in the word. Why is that and how can I fix this? I've thought about replacing every ü by ue and every ä by ae and every ö by oe and later undo the replacement, but is there yet another possibility? I'm working with Visual Studio 2015.

cshu · Accepted Answer

Use regex check {"(^|[\x60\x00-\x2f\x3a-\x40\x5b-\x5e\x7b-\x7e])über($|[\x60\x00-\x2f\x3a-\x40\x5b-\x5e\x7b-\x7e])", regex_constants::icase}; instead.

The default grammar of C++ regex is similar to JavaScript. \b doesn't support Unicode.

And from microsoft.com:

Word Boundary

A word boundary occurs in the following situations:

The current character is at the beginning of the target sequence and is one of the word characters A-Za-z0-9_.

The current character position is past the end of the target sequence and the last character in the target sequence is one of the word characters.

The current character is one of the word characters and the preceding character is not.

The current character is not one of the word characters and the preceding character is.

German Umlaute and Regular Expressions

Answers (1)

Related Questions