Reputation: 3484
I have a ruby program that I'm trying to upgrade form ruby 1.8 to ruby 2.0.0-p247.
This works just fine in 1.8.7:
begin
ARGF.each do |line|
# a collection of pecluliarlities, appended as they appear in data
line.gsub!("\x92", "'")
line.gsub!("\x96", "-")
puts line
end
rescue => e
$stderr << "exception on line #{$.}:\n"
$stderr << "#{e.message}:\n"
$stderr << @line
end
But under ruby 2.0, this results in this an exxeption when encountering the 96 or 92 encoded into a data file that otherwise contains what appears to be ASCII:
invalid byte sequence in UTF-8
I have tried all manner of things: double backslashes, using a regex object instead of the string, force_encoding(), etc. and am stumped.
Can anybody fill in the missing puzzle piece for me?
Thanks.
=============== additions: 2013-09-25 ============
Changing \x92 to \u2019 did not fix the problem.
The program does not error until it actually hits a 92 or 96 in the input file, so I'm confused as to how the character pattern in the string is the problem when there are hundreds of thousands of lines of input data that are matched against the patterns without incident.
Upvotes: 0
Views: 532
Reputation: 75232
It's not the regex that's throwing the exception, it's the Ruby compiler. \x92
and \x96
are how you would represent ’
and –
in the windows-1252 encoding, but Ruby expects the string to be UTF-8 encoded. You need to get out of the habit of putting raw byte values like \x92
in your string literals. Non-ASCII characters should be specified by Unicode escape sequences (in this case, \u2019
and \u2013
).
It's a Unicode world now, stop thinking of text in terms of bytes and think in terms of characters instead.
Upvotes: 3