Reputation: 3091
I use the code below:
puts "matched" if "中国" =~ /\w+/
it puts "matched"
and surprised me, since "中国" is two Chinese characters, it doesn't any of 0-9, a-z, A-Z and _, but why it outputs "matched".
Could somebody give me some clues?
Upvotes: 10
Views: 3814
Reputation: 24506
I'm not sure of the exact flavor of regex that Ruby uses, but this isn't just a Ruby aberration as .net works this way as well. MSDN says this about it:
\w
Matches any word character. For non-Unicode and ECMAScript implementations, this is the same as [a-zA-Z_0-9]. In Unicode categories, this is the same as [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}].
So it's not the case that \w
necessarily just means [a-zA-Z_0-9]
- it (and other operators) operate differently on Unicode strings compared to how they do for Ascii ones.
This still makes it different from .
though, as \w
wouldn't match punctuation characters (sort of - see the \p{Lo} list below though) , spaces, new lines and various other non-word symbols.
As for what exactly \p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}
does match, you can see on a Unicode reference list:
Upvotes: 12
Reputation: 160571
Oniguruma, which is the regex engine in Ruby 1.9+, defines \w
as:
[\w] word character
Not Unicode:
* alphanumeric, "_" and multibyte char.
Unicode:
* General_Category -- (Letter|Mark|Number|Connector_Punctuation)
In 1.9+, Ruby knows if the string has Unicode characters, and automatically switches to use Unicode mode for pattern matching.
Upvotes: 3