joost
joost

Reputation: 6659

Encoding::UndefinedConversionError when using open-uri

When I do this:

require 'open-uri'
response = open('some-html-page-url-here')
response.read

On a certain url I get the following error (due to wrong encoding in the returned url?!):

Encoding::UndefinedConversionError: U+00A0 from UTF-8 to US-ASCII

Any way around this to still get the html content?

Upvotes: 1

Views: 2228

Answers (3)

7stud
7stud

Reputation: 48599

In the introduction to the open-uri module, the docs say this,

It is possible to open an http, https or ftp URL as though it were a file

And if you know anything about reading files, then you have to know the encoding of the file you are trying to read. You need to know the encoding so that you can tell ruby how to read the file(i.e. how many bytes(or how much space) each character will occupy).

In the first code example in the docs, there is this:

  open("http://www.ruby-lang.org/en") {|f|
    f.each_line {|line| p line}
    p f.base_uri         # <URI::HTTP:0x40e6ef2 URL:http://www.ruby-lang.org/en/>
    p f.content_type     # "text/html"
    p f.charset          # "iso-8859-1"
    p f.content_encoding # []
    p f.last_modified    # Thu Dec 05 02:45:02 UTC 2002
  }

So if you don't know the encoding of the "file" you are trying to read, you can get the encoding with f.charset. If that encoding is different than your default external encoding, you will most likely get an error. Your default external encoding is the encoding ruby uses to read from external sources. You can check what your default external encoding is set to like this:

The default external Encoding is pulled from your environment...Have a look:

$ echo $LC_CTYPE
en_US.UTF-8

or

$ ruby -e 'puts Encoding.default_external.name'
UTF-8

http://graysoftinc.com/character-encodings/ruby-19s-three-default-encodings

On Mac OSX, I actually have to do the following to see the default external encoding:

$ echo $LANG

You can set your default external encoding with the method Encoding.default_external=(), so you might want to try something like this:

  open('some_url_here') do |f|
    Encoding.default_external = f.charset
    html = f.read
  end

Setting an IO object to binmode, like you have done, tells ruby that the encoding of the file is BINARY (or ruby's confusing synonym ASCII-8BIT), which means you are telling ruby that each character in the file takes up one byte. In your case, you are telling ruby to read the character U+00A0, whose UTF-8 representation takes up two bytes 0xC2 0xA0, as two characters instead of just one character, so you have eliminated your error, but you have produced two junk characters instead of the original character.

Upvotes: 3

Delong  Gao
Delong Gao

Reputation: 89

Had the same issue, will add my solution here:

After reading the open-uri documentation further, it turns out you could set the encoding of the io before reading using the set_encoding method, like this:

result = open('some-page-uri') do |io|
  io.set_encoding(Encoding.default_external)
  io.read
end

Hope it helps!

Upvotes: 2

joost
joost

Reputation: 6659

Doing a response.binmode before the response.read stops the error from happening.

Upvotes: 2

Related Questions