Draco
Draco

Reputation: 337

UTF-8 conversion not working with String#encode but Iconv

I had this with Iconv:

git_log = Iconv.conv 'UTF-8', 'iso8859-1', git_log

Now I want to change it to use String#encode due to deprecation warnings, but I can't, doesn't work:

git_log = git_log.encode(Encoding::UTF_8, :invalid => :replace, :undef => :replace, :replace => '')

I used to use Iconv here, and it's still working:

https://github.com/gamersmafia/gamersmafia/blob/master/lib/formatting.rb#L244

But when I replace these line with String#encode method, first gsub raises a "invalid byte sequence in UTF-8" error.

Do you know why?

Upvotes: 3

Views: 1500

Answers (2)

s2t2
s2t2

Reputation: 2696

Try the following approach, which removes a character from a string if the character is mal-encoded:

invalid_character_indices = []
mystring.each_char.with_index do |char, i|
  invalid_character_indices << i unless char == char.encode(Encoding::UTF_8, Encoding::ISO_8859_1,:invalid => :replace, :undef => :replace, :replace => "")
end
invalid_character_indices.each do |i|
  mystring.delete!(mystring[i])
end

Upvotes: 0

matt
matt

Reputation: 79743

In your call to String#encode you don’t specify a source encoding. Ruby is using the strings current encoding as the source, which appears to be UTF-8, and according to the docs:

Please note that conversion from an encoding enc to the same encoding enc is a no-op, i.e. the receiver is returned without any changes, and no exceptions are raised, even if there are invalid bytes.

In other words the call has no effect, and leaves the bytes in the string as they are, encoded as ISO-8859-1. The next call to gsub then tries to interpret these bytes as UTF-8, and since they are invalid (they are unchanged from ISO-8859-1) you get the error you see.

String#encode has a a form that accepts the source encoding as the second parameter, so you can explicitly specify it, similarly to what you are doing with Iconv. Try this:

git_log = git_log.encode(Encoding::UTF_8,
                         Encoding::ISO_8859_1,
                         :invalid => :replace,
                         :undef => :replace,
                         :replace => '')

You could also use the ! form in this case, which has the same effect:

git_log.encode!(Encoding::UTF_8,
                Encoding::ISO_8859_1,
                :invalid => :replace,
                :undef => :replace,
                :replace => '')

Upvotes: 6

Related Questions