msdundar
msdundar

Reputation: 383

Ruby 2.1.5 - ArgumentError: invalid byte sequence in UTF-8

I'm having trouble with UTF8 chars in Ruby 2.1.5 and Rails 4.

The problem is, the data which come from an external service are like that:

"first_name"=>"ezgi \xE7enberci"
"last_name" => "\xFC\xFE\xE7\xF0i\xFE\xFE\xF6\xE7"

These characters mostly include Turkish alphabet characters like "üğşiçö". When the application tries to save these data, the errors below occur:

ArgumentError: invalid byte sequence in UTF-8
Mysql2::Error: Incorrect string value

How can I fix this?

Upvotes: 0

Views: 1198

Answers (2)

Todd A. Jacobs
Todd A. Jacobs

Reputation: 84343

What's Wrong

Ruby thinks you have invalid byte sequences because your strings aren't UTF-8. For example, using the rchardet gem:

require 'chardet'
["ezgi \xE7enberci", "\xFC\xFE\xE7\xF0i\xFE\xFE\xF6\xE7"].map do str
  puts CharDet.detect str
end

#=> [{"encoding"=>"ISO-8859-2", "confidence"=>0.8600826867857209}, {"encoding"=>"windows-1255", "confidence"=>0.5807177322740268}]

How to Fix It

You need to use String#scrub or one of the encoding methods like String#encode! to clean up your strings first. For example:

hash = {"first_name"=>"ezgi \xE7enberci",
        "last_name"=>"\xFC\xFE\xE7\xF0i\xFE\xFE\xF6\xE7"}
hash.each_pair { |k,v| k[v.encode! "UTF-8", "ISO-8859-2"] }
#=> {"first_name"=>"ezgi çenberci", "last_name"=>"üţçđiţţöç"}

Obviously, you may need to experiment a bit to figure out what the proper encoding is (e.g. ISO-8859-2, windows-1255, or something else entirely) but ensuring that you have a consistent encoding of your data set is going to be critical for you.

Character encoding detection is imperfect. Your best bet will be to try to find out what encoding your external data source is using, and use that in your string encoding rather than trying to detect it automatically. Otherwise, your mileage may vary.

Upvotes: 2

Frederick Cheung
Frederick Cheung

Reputation: 84114

That doesn't look like utf-8 data so this exception is normal. Sounds like you need to tell ruby what encoding the string is actually in:

some_string.force_encoding("windows-1254")

You can then convert to UTF8 with the encode method. There are gems (eg charlock_holmes) that have heuristics for auto detecting encodings if you're getting a mix of encodings

Upvotes: 1

Related Questions