ywenbo
ywenbo

Reputation: 3091

\w in Ruby Regular Expression matches Chinese characters

I use the code below:

puts "matched"  if "中国" =~ /\w+/

it puts "matched" and surprised me, since "中国" is two Chinese characters, it doesn't any of 0-9, a-z, A-Z and _, but why it outputs "matched".

Could somebody give me some clues?

Upvotes: 10

Views: 3814

Answers (2)

Michael Low
Michael Low

Reputation: 24506

I'm not sure of the exact flavor of regex that Ruby uses, but this isn't just a Ruby aberration as .net works this way as well. MSDN says this about it:

\w
Matches any word character. For non-Unicode and ECMAScript implementations, this is the same as [a-zA-Z_0-9]. In Unicode categories, this is the same as [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}].

So it's not the case that \w necessarily just means [a-zA-Z_0-9] - it (and other operators) operate differently on Unicode strings compared to how they do for Ascii ones.

This still makes it different from . though, as \w wouldn't match punctuation characters (sort of - see the \p{Lo} list below though) , spaces, new lines and various other non-word symbols.

As for what exactly \p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc} does match, you can see on a Unicode reference list:

Upvotes: 12

the Tin Man
the Tin Man

Reputation: 160571

Oniguruma, which is the regex engine in Ruby 1.9+, defines \w as:

[\w]       word character

           Not Unicode:
           * alphanumeric, "_" and multibyte char. 
           Unicode:
           * General_Category -- (Letter|Mark|Number|Connector_Punctuation)

In 1.9+, Ruby knows if the string has Unicode characters, and automatically switches to use Unicode mode for pattern matching.

Upvotes: 3

Related Questions