Reputation: 3866

Convert non-ASCII chars from ASCII-8BIT to UTF-8

I'm pulling text from remote sites and trying to load it into a Ruby 1.9/Rails 3 app that uses utf-8 by default.

Here is an example of some offending text:

Cancer Res; 71(3); 1-11. ©2011 AACR.\n

That Copyright code expanded looks like this:

Cancer Res; 71(3); 1-11. \xC2\xA92011 AACR.\n

Ruby tells me that string is encoded as ASCII-8BIT and feeding into my Rails app gets me this:

incompatible character encodings: ASCII-8BIT and UTF-8

I can strip the copyright code out using this regex

str.gsub(/[\x00-\x7F]/n,'?')

to produce this

Cancer Res; 71(3); 1-11. ??2011 AACR.\n

But how can I get a copyright symbol (and various other symbols such as greek letters) converted into the same symbols in UTF-8? Surely it is possible...

I see references to using force_encoding but this does not work:

str.force_encoding('utf-8').encode

I realize there are many other people with similar issues but I've yet to see a solution that works.

Upvotes: 53

Answers (4)

Jared Menard

Reputation: 2756

I've been having issues with character encoding, and the other answers have been helpful, but didn't work for every case. Here's the solution I came up with that forces encoding when possible and transcodes using '?'s when not possible. Here's the solution:

  def encode str
    encoded = str.force_encoding('UTF-8')
    unless encoded.valid_encoding?
      encoded = str.encode("utf-8", invalid: :replace, undef: :replace, replace: '?')
    end
    encoded
  end

force_encoding works most of the time, but I've encountered some strings where that fails. Strings like this will have invalid characters replaced:

 str = "don't panic: \xD3"
 str.valid_encoding?
 false
 str = str.encode("utf-8", invalid: :replace, undef: :replace, replace: '?')
 "don't panic: ?"
 str.valid_encoding?
 true

Update: I have had some issues in production with the above code. I recommend that you set up unit tests with known problem text to make sure that this code works for you like you need it to. Once I come up with version 2 I'll update this answer.

Upvotes: 2

Jason Heiss

Reputation: 671

There are two possibilities:

The input data is already UTF-8, but Ruby just doesn't know it. That seems to be your case, as "\xC2\xA9" is valid UTF-8 for the copyright symbol. In which case you just need to tell Ruby that the data is already UTF-8 using force_encoding.

For example "\xC2\xA9".force_encoding('ASCII-8BIT') would recreate the relevant bit of your input data. And "\xC2\xA9".force_encoding('ASCII-8BIT').force_encoding('UTF-8') would demonstrate that you can tell Ruby that it is really UTF-8 and get the desired result.
The input data is in some other encoding and you need Ruby to transcode it to UTF-8. In that case you'd have to tell Ruby what the current encoding is (ASCII-8BIT is ruby-speak for binary, it isn't a real encoding), then tell Ruby to transcode it.

For example, say your input data was ISO-8859-1. In that encoding the copyright symbol is just "\xA9". This would generate such a bit of data: "\xA9".force_encoding('ISO-8859-1') And this would demonstrate that you can get Ruby to transcode that to UTF-8: "\xA9".force_encoding('ISO-8859-1').encode('UTF-8')

Upvotes: 34

Achilles

Reputation: 766

I used to do this for a script that scraped Greek Windows-encoded pages, using open-uri, iconv and Hpricot:

doc = open(DATA_URL)
doc.rewind
data = Hpricot(Iconv.conv('utf-8', "WINDOWS-1253", doc.readlines.join("\n")))

I believe that was Ruby 1.8.7, not sure how things are with ruby 1.9

Upvotes: 6

Phrogz

Reputation: 303208

This works for me:

#encoding: ASCII-8BIT
str = "\xC2\xA92011 AACR"
p str, str.encoding
#=> "\xC2\xA92011 AACR"
#=> #<Encoding:ASCII-8BIT>

str.force_encoding('UTF-8')
p str, str.encoding
#=> "©2011 AACR"
#=> #<Encoding:UTF-8>

Upvotes: 79

Convert non-ASCII chars from ASCII-8BIT to UTF-8

Answers (4)

Related Questions