Reputation: 3338
irb(main):010:0> str = "sar\xE0".force_encoding "ASCII-8BIT"
irb(main):011:0> str.encode 'ISO-8859-1', "ASCII-8BIT"
Encoding::UndefinedConversionError: "\xE0" to UTF-8 in conversion from ASCII-8BIT to UTF-8 to ISO-8859-1
from (irb):11:in `encode'
from (irb):11
from /Users/ben/.rbenv/versions/2.4.1/bin/irb:11:in `<main>'
I have a string as ASCII-8BIT
(binary), and I want to bring it to another encoding, but it seems that every conversion tries to convert it before to utf-8
and so it fails (basically it forces me to substitute undefined chars).
Why is this happening? How can I avoid it?
Upvotes: 2
Views: 4257
Reputation: 114148
Given a string in binary (ASCII-8BIT
) encoding:
str = "sar\xE0".b #=> "sar\xE0"
str.encoding #=> #<Encoding:ASCII-8BIT>
You can tell Ruby that this string is actually in ISO-8859-1 via force_encoding
:
str.force_encoding('ISO-8859-1') #=> "sar\xE0"
str.encoding #=> #<Encoding:ISO-8859-1>
Note that you still see \xE0
because Ruby does not attempt to convert the character.
Printing the string on a UTF-8 terminal gives:
puts str
sar�
The replacement character � is shown, because 0xE0
is an invalid byte in UTF-8.
Printing the same string on a ISO-8859-1 terminal however gives:
puts str
sarà
To work with the string in Ruby, you usually want to convert it to UTF-8 via encode!
:
str.encode!('UTF-8') #=> "sarà"
str.encoding #=> #<Encoding:UTF-8>
Or in a single step by passing both, the destination encoding and the source encodings to encode!
:
str = "sar\xE0".b #=> "sar\xE0"
str.encode!('UTF-8', 'ISO-8859-1') #=> "sarà"
str.encoding #=> #<Encoding:UTF-8>
Upvotes: 3