Reputation: 114228
Ruby's character classes for punctuation characters, i.e. [:punct:]
, \p{Punct}
, or \p{P}
seem to match different characters depending on the Ruby version I'm using.
Here's a little example: (sorry for messing with SO's syntax highlighter)
# punct.rb
chars = <<-EOD.split
! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ ] ^ _ ` { | } ~
EOD
matched, not_matched = chars.partition { |c| c =~ /[[:punct:]]/ }
puts " matched: #{matched.join}"
puts "not matched: #{not_matched.join}"
With Ruby 1.9.3 and again with Ruby 2.4.0 I get:
matched: !"#$%&'()*+,-./:;<=>?@[]^_`{|}~
not matched:
But various Ruby versions in-between (2.0.x, 2.1.x, 2.2.x, 2.3.x) give me:
matched: !"#%&'()*,-./:;?@[]_{}
not matched: $+<=>^`|~
Why is this happening and what is the correct behavior? And even more important: how can I achieve a consistent result across Ruby versions?
Trying to change my locale was to no avail (as suggested by Why does Ruby /[[:punct:]]/ miss some punctuation characters?).
Upvotes: 4
Views: 348
Reputation: 5345
Ruby 1.9.3 used US_ASCII as its default encoding, which properly matched all punctuation. Ruby 2.0 switched its default encoding to UTF-8, introducing the bug you discovered, which causes punctuation to be improperly matched. Ruby 2.4 patched this bug.
The correct behavior would be to match all punctuation, as ruby 1.9.3 and 2.4 do. This is consistent with the POSIX regex definition for punctuation.
One choice for making your code consistent is to encode all strings as US_ASCII or an alternative which doesn't have the UTF-8 bug:
matched, unmatched = chars.partition { |c| c.encode(Encoding::US_ASCII) =~ /[[:punct:]]/ }
But that's probably not desirable because it forces you to use a restrictive encoding for your strings.
The other option is to manually define the punctuation:
/[!"\#$%&'()*+,\-./:;<=>?@\[\\\]^_`{|}~]/
It's somewhat inelegant, but you can throw it into a variable and add it to regexes that way:
punctuation = "[!\"\#$%&'()*+,\-./:;<=>?@\[\\\]^_`{|}~]"
my_regex = /#{punctuation}/
Upvotes: 4