Reputation: 383
I'm having trouble with UTF8 chars in Ruby 2.1.5 and Rails 4.
The problem is, the data which come from an external service are like that:
"first_name"=>"ezgi \xE7enberci"
"last_name" => "\xFC\xFE\xE7\xF0i\xFE\xFE\xF6\xE7"
These characters mostly include Turkish alphabet characters like "üğşiçö". When the application tries to save these data, the errors below occur:
ArgumentError: invalid byte sequence in UTF-8
Mysql2::Error: Incorrect string value
How can I fix this?
Upvotes: 0
Views: 1198
Reputation: 84343
Ruby thinks you have invalid byte sequences because your strings aren't UTF-8. For example, using the rchardet gem:
require 'chardet'
["ezgi \xE7enberci", "\xFC\xFE\xE7\xF0i\xFE\xFE\xF6\xE7"].map do str
puts CharDet.detect str
end
#=> [{"encoding"=>"ISO-8859-2", "confidence"=>0.8600826867857209}, {"encoding"=>"windows-1255", "confidence"=>0.5807177322740268}]
You need to use String#scrub or one of the encoding methods like String#encode! to clean up your strings first. For example:
hash = {"first_name"=>"ezgi \xE7enberci",
"last_name"=>"\xFC\xFE\xE7\xF0i\xFE\xFE\xF6\xE7"}
hash.each_pair { |k,v| k[v.encode! "UTF-8", "ISO-8859-2"] }
#=> {"first_name"=>"ezgi çenberci", "last_name"=>"üţçđiţţöç"}
Obviously, you may need to experiment a bit to figure out what the proper encoding is (e.g. ISO-8859-2, windows-1255, or something else entirely) but ensuring that you have a consistent encoding of your data set is going to be critical for you.
Character encoding detection is imperfect. Your best bet will be to try to find out what encoding your external data source is using, and use that in your string encoding rather than trying to detect it automatically. Otherwise, your mileage may vary.
Upvotes: 2
Reputation: 84114
That doesn't look like utf-8 data so this exception is normal. Sounds like you need to tell ruby what encoding the string is actually in:
some_string.force_encoding("windows-1254")
You can then convert to UTF8 with the encode
method. There are gems (eg charlock_holmes) that have heuristics for auto detecting encodings if you're getting a mix of encodings
Upvotes: 1