plotti
plotti

Reputation: 137

UTF-8 ruby encoding

I've got this string: WinterIDäSchwiiz, which comes from an API and I want to search for it in the database. Now it turns out that this string has a different encoding than how its saved in my database. Yet ruby says the encoding for both is utf-8. What is going on?

I've figured out the most terrible way to fix this problem by going down to the bytesequence and replace the bytes representing the "ä" with a different bytesequence and then forceencoding it to utf8. It works but hurts my eyes. Does anyone have a better solution than:

 "WinterIDäSchwiiz".bytes.join(",").gsub("97,204,136","195,164").split(",").collect{|s| s.to_i}.pack('C*').force_encoding('utf-8')

Upvotes: 1

Views: 502

Answers (1)

Jordan Running
Jordan Running

Reputation: 106027

Your string is UTF-8.

I can tell because your fix is to replace the bytes (97, 204, 136) with the bytes (195, 164).

The first byte you're replacing, 97 (0x61) is the UTF-8 character a. The second two bytes, 204 and 136 (0xCC 0x88), are the bytes for the UTF-8 character U+0308, the combining diaeresis: ̈. The two characters combine to form .

The bytes you're expecting are 195 and 164 (0xC3 0xA4) which, together, are U+00E4, or Latin small letter "a" with diaeresis.

Both are UTF-8. One prints and the other prints ä. This is an example of Unicode equivalence.

In other words:

str1 = "a\xCC\x88"
puts str1 # => ä
p str1.bytes # => [97, 204, 136]
p str1.encoding # => #<Encoding:UTF-8>

str2 = "\xC3\xA4"
puts str2 # => ä
p str2.bytes # => [195, 164]
p str2.encoding # => #<Encoding:UTF-8>

Fortunately, we have Unicode normalization to help deal with this. This is a big topic, but the very, very insufficient TL;DR is that the Unicode consortium has prescribed standard ways to normalize strings like the above, i.e. how to turn str1 into str2.

Unfortunately, it's impossible to say what the best solution for you is, since you didn't provide any details. Your database might have built-in normalization functionality, but I don't know what database you're using so I can't say. Since you did mention Ruby I can point you to the String#unicode_normalize method, which was introduced in Ruby's standard library in Ruby 2.2:

str1 = "a\xCC\x88"
str2 = "\xC3\xA4"
p str1 == str2 # => false

str1_normalized = str1.unicode_normalize

p str1_normalized == str2
# => true
p str1_normalized.bytes == str2.bytes
# => true

If you don't have Ruby 2.2+, well... upgrade. But if you can't upgrade for some reason you can use ActiveSupport::Multibyte::Unicode.normalize, which is especially convenient if you're using Rails, or the Unicode gem.

One more thing

You don't need to do this, since the above is the correct way to do Unicode normalization in Ruby, but a much easier way to do this:

"WinterIDäSchwiiz".bytes.join(",").gsub("97,204,136","195,164").split(",").collect{|s| s.to_i }.pack('C*').force_encoding('utf-8')

...would have been this:

"WinterIDäSchwiiz".gsub("a\xCC\x88", "\xC3\xA4")

Any time you see something like join(",")...split(",") in Ruby it's almost certainly the wrong solution.

Upvotes: 2

Related Questions