ciaoben
ciaoben

Reputation: 3338

How to convert encoding from ASCII-8BIT to another, without passing through UTF-8 in ruby?

irb(main):010:0> str = "sar\xE0".force_encoding "ASCII-8BIT"
irb(main):011:0> str.encode 'ISO-8859-1', "ASCII-8BIT"
Encoding::UndefinedConversionError: "\xE0" to UTF-8 in conversion from ASCII-8BIT to UTF-8 to ISO-8859-1
    from (irb):11:in `encode'
    from (irb):11
    from /Users/ben/.rbenv/versions/2.4.1/bin/irb:11:in `<main>'

I have a string as ASCII-8BIT (binary), and I want to bring it to another encoding, but it seems that every conversion tries to convert it before to utf-8 and so it fails (basically it forces me to substitute undefined chars).

Why is this happening? How can I avoid it?

Upvotes: 2

Views: 4257

Answers (1)

Stefan
Stefan

Reputation: 114148

Given a string in binary (ASCII-8BIT) encoding:

str = "sar\xE0".b #=> "sar\xE0"
str.encoding      #=> #<Encoding:ASCII-8BIT>

You can tell Ruby that this string is actually in ISO-8859-1 via force_encoding:

str.force_encoding('ISO-8859-1') #=> "sar\xE0"
str.encoding                     #=> #<Encoding:ISO-8859-1>

Note that you still see \xE0 because Ruby does not attempt to convert the character.

Printing the string on a UTF-8 terminal gives:

puts str
sar�

The replacement character � is shown, because 0xE0 is an invalid byte in UTF-8.

Printing the same string on a ISO-8859-1 terminal however gives:

puts str
sarà

To work with the string in Ruby, you usually want to convert it to UTF-8 via encode!:

str.encode!('UTF-8') #=> "sarà"
str.encoding         #=> #<Encoding:UTF-8>

Or in a single step by passing both, the destination encoding and the source encodings to encode!:

str = "sar\xE0".b                  #=> "sar\xE0"
str.encode!('UTF-8', 'ISO-8859-1') #=> "sarà"
str.encoding                       #=> #<Encoding:UTF-8>

Upvotes: 3

Related Questions