Nick Ginanto
Nick Ginanto

Reputation: 32130

Handling encoding in ruby

I have a good string and a bad string

to handle a bad string I do

bad.encode("iso-8859-1").force_encoding("utf-8")

which makes it readable

if I do

good.encode("iso-8859-1").force_encoding("utf-8")

I get Encoding::UndefinedConversionError: U+05E2 from UTF-8 to ISO-8859-1

both good and bad string are in UTF-8 in the beginning, but the good strings are readable and the bad are, well, bad.

I don't know how to detect if a string is good or not, and I am trying to find a way to work on a string and to make it readable in the correct encoding

something like that

if needs_fixin?(str)
  str.encode("iso-8859-1").force_encoding("utf-8")
else
  str
end

The only thing I can think of is to catch exception skip the encoding fixing part, but I don't want the code to have exceptions intentionally.

something like str.try(:encode, "iso-8859-1").force_encoding("utf-8") rescue str

bad string is something like

×¢×××× ×¢×¥ ×'××¤×¡× ×פת×ר ×× ××רק××

Upvotes: 1

Views: 732

Answers (1)

Chris Heald
Chris Heald

Reputation: 62648

I suspect your problem is double-encoded strings. This is very bad for various reasons, but the tl;dr here is it's not fully fixable, and you should instead fix the root problem of strings being double-encoded if at all possible.

This produces a double-encoded string with UTF-8 characters:

> str = "汉语 / 漢語"
 => "汉语 / 漢語"
> str.force_encoding("iso-8859-1")
 => "\xE6\xB1\x89\xE8\xAF\xAD / \xE6\xBC\xA2\xE8\xAA\x9E"
> bad = str.force_encoding("iso-8859-1").encode("utf-8")
 => "æ±\u0089语 / æ¼¢èª\u009E"

You can then fix it by reinterpreting the double-encoded UTF-8 as ISO-8859-1 and then declaring the encoding to actually be UTF-8

> bad.encode("iso-8859-1").force_encoding("utf-8")
 => "汉语 / 漢語"

But you can't convert the actual UTF-8 string into ISO-8859-1, since there are codepoints in UTF-8 which ISO-8859-1 doesn't have any unambiguous means of encoding

> str.encode("iso-8859-1")
Encoding::UndefinedConversionError: ""\xE6\xB1\x89"" from UTF-8 to ISO-8859-1

Now, you can't actually detect and fix this all the time because "there's no way to tell whether the result is from incorrectly double-encoding one character, or correctly single-encoding 2 characters."

So, the best you're left with is a heuristic. Borshuno's suggestion won't work here because it will actually destroy unconvertable bytes:

> str.encode( "iso-8859-1", fallback: lambda{|c| c.force_encoding("utf-8")} )
 .0=> " / "

The best course of action, if at all possible, is to fix your double-encoding issue so that it doesn't happen at all. The next best course of action is to add BOM bytes to your UTF-8 strings if you suspect they may get double-encoded, since you could then check for those bytes and determine whether your string has been re-encoded or not.

> str_bom = "\xEF\xBB\xBF" + str
 => "汉语 / 漢語"
> str_bom.start_with?("\xEF\xBB\xBF")
 => true
> str_bom.force_encoding("iso-8859-1").encode("utf-8").start_with?("\xEF\xBB\xBF")
 => false

If you can presume that the BOM is in your "proper" string, then you can check for double-encoding by checking if the BOM is present. If it's not (ie, it's been re-encoded) then you can perform your decoding routine:

> str_bom.force_encoding("iso-8859-1").encode("utf-8").encode("iso-8859-1").force_encoding("utf-8").start_with?("\xEF\xBB\xBF")
 => true

If you can't be assured of the BOM, then you could use a heuristic to guess whether a string is "bad" or not, by counting unprintable characters, or characters which fall outside of your normal expected result set (your string looks like it's dealing with Hebrew; you could say that any string which consists of >50% non-Hebrew letters is double-encoded, for example), so you could then attempt to decode it.

Finally, you would have to fall back to exception handling and hope that you know which encoding the string was purportedly declared as when it was double-encoded:

str = "汉语 / 漢語"
begin
  str.encode("iso-8859-1").encode("utf-8")
rescue Encoding::UndefinedConversionError
  str
end

However, even if you know that a string is double-encoded, if you don't know the encoding that it was improperly declared as when it was converted to UTF-8, you can't do the reverse operation:

> bad_str = str.force_encoding("windows-1252").encode("utf-8")
 => "汉语 / 漢語"
> bad_str.encode("iso-8859-1").force_encoding("utf-8")
Encoding::UndefinedConversionError: "\xE2\x80\xB0" from UTF-8 to ISO-8859-1

Since the string itself doesn't carry any information about the encoding it was incorrectly encoded from, you don't have enough information to reliably solve it, and are left with iterating through a list of most-likely encodings and heuristically checking the result of each successful re-encode with your Hebrew heuristic.

To echo the post I linked: character encodings are hard.

Upvotes: 5

Related Questions