Reputation: 337
I had this with Iconv:
git_log = Iconv.conv 'UTF-8', 'iso8859-1', git_log
Now I want to change it to use String#encode due to deprecation warnings, but I can't, doesn't work:
git_log = git_log.encode(Encoding::UTF_8, :invalid => :replace, :undef => :replace, :replace => '')
I used to use Iconv here, and it's still working:
https://github.com/gamersmafia/gamersmafia/blob/master/lib/formatting.rb#L244
But when I replace these line with String#encode method, first gsub raises a "invalid byte sequence in UTF-8" error.
Do you know why?
Upvotes: 3
Views: 1500
Reputation: 2696
Try the following approach, which removes a character from a string if the character is mal-encoded:
invalid_character_indices = []
mystring.each_char.with_index do |char, i|
invalid_character_indices << i unless char == char.encode(Encoding::UTF_8, Encoding::ISO_8859_1,:invalid => :replace, :undef => :replace, :replace => "")
end
invalid_character_indices.each do |i|
mystring.delete!(mystring[i])
end
Upvotes: 0
Reputation: 79743
In your call to String#encode
you don’t specify a source encoding. Ruby is using the strings current encoding as the source, which appears to be UTF-8, and according to the docs:
Please note that conversion from an encoding
enc
to the same encodingenc
is a no-op, i.e. the receiver is returned without any changes, and no exceptions are raised, even if there are invalid bytes.
In other words the call has no effect, and leaves the bytes in the string as they are, encoded as ISO-8859-1. The next call to gsub
then tries to interpret these bytes as UTF-8, and since they are invalid (they are unchanged from ISO-8859-1) you get the error you see.
String#encode
has a a form that accepts the source encoding as the second parameter, so you can explicitly specify it, similarly to what you are doing with Iconv. Try this:
git_log = git_log.encode(Encoding::UTF_8,
Encoding::ISO_8859_1,
:invalid => :replace,
:undef => :replace,
:replace => '')
You could also use the !
form in this case, which has the same effect:
git_log.encode!(Encoding::UTF_8,
Encoding::ISO_8859_1,
:invalid => :replace,
:undef => :replace,
:replace => '')
Upvotes: 6