\w in Ruby Regular Expression matches Chinese characters

Question

I use the code below:

puts "matched"  if "中国" =~ /\w+/

it puts "matched" and surprised me, since "中国" is two Chinese characters, it doesn't any of 0-9, a-z, A-Z and _, but why it outputs "matched".

Could somebody give me some clues?

Michael Low · Accepted Answer

I'm not sure of the exact flavor of regex that Ruby uses, but this isn't just a Ruby aberration as .net works this way as well. MSDN says this about it:

\w
Matches any word character. For non-Unicode and ECMAScript implementations, this is the same as [a-zA-Z_0-9]. In Unicode categories, this is the same as [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}].

So it's not the case that \w necessarily just means [a-zA-Z_0-9] - it (and other operators) operate differently on Unicode strings compared to how they do for Ascii ones.

This still makes it different from . though, as \w wouldn't match punctuation characters (sort of - see the \p{Lo} list below though) , spaces, new lines and various other non-word symbols.

As for what exactly \p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc} does match, you can see on a Unicode reference list:

\w in Ruby Regular Expression matches Chinese characters

Answers (2)

Related Questions