Reputation: 10939
How ensure \w in Regexp treat national letters same as normal letters ?
'ein grüner Hund'.scan(/\S+/u)
["ein", "grüner", "Hund"]
It scans correctly the "ü" as non whitespace character.
'ein grüner Hund'.scan(/\w+/u)
["ein", "gr", "ner", "Hund"]
How get the "ü" too ?
I need a solution not only for german, french and polish characters should work too.
Upvotes: 1
Views: 2325
Reputation: 5236
\w
will work for letter or digit or underscore. Your regex engine might be considering a single Unicode code point as a single character. In that case, ü will not be matched as a single character as it is multiple code point character (encoded as two code points). For matching multiple code point characters also, use \X
which matches a single unicode grapheme whether it is single code point or multiple code point.
Check this for more information.
I'm not sure whether Ruby supports \X
. Otherwise \p{L}\p{M}*
can be used, which matches a letter along with accent.
Upvotes: 2