astropanic
astropanic

Reputation: 10939

Ruby 1.9.3 Regex utf8 \w accented characters

How ensure \w in Regexp treat national letters same as normal letters ?

'ein grüner Hund'.scan(/\S+/u)

["ein", "grüner", "Hund"]

It scans correctly the "ü" as non whitespace character.

'ein grüner Hund'.scan(/\w+/u)

["ein", "gr", "ner", "Hund"]

How get the "ü" too ?

I need a solution not only for german, french and polish characters should work too.

Upvotes: 1

Views: 2325

Answers (2)

Naveed S
Naveed S

Reputation: 5236

\w will work for letter or digit or underscore. Your regex engine might be considering a single Unicode code point as a single character. In that case, ü will not be matched as a single character as it is multiple code point character (encoded as two code points). For matching multiple code point characters also, use \X which matches a single unicode grapheme whether it is single code point or multiple code point.

Check this for more information.

I'm not sure whether Ruby supports \X. Otherwise \p{L}\p{M}* can be used, which matches a letter along with accent.

Upvotes: 2

Yevgeniy Anfilofyev
Yevgeniy Anfilofyev

Reputation: 4847

Try

'ein grüner Hund'.scan(/[[:word:]]+/u)

Documentation

Upvotes: 2

Related Questions