castiel
castiel

Reputation: 2783

hpricot-invalid byte sequence in UTF-8

I already done some searches but none of that can solve this peculiar,unexpected problem. Just look at the code blow:

require 'open-uri'
require 'hpricot'
doc = Hpricot(open("http://www.baidu.com/")) #this web page's encoding is GB2312

I don't know what's going on here,you can this in your irb to see if you can get the problem

It just pop up "ArgumentError: invalid byte sequence in UTF-8"

I have try to convert the original HTML into utf-8 by Iconv but it still won't work

Guys,I really don't what to do now,please help me

Upvotes: 1

Views: 1092

Answers (2)

Dipak Panchal
Dipak Panchal

Reputation: 6036

Hpricot - UTF-8 issues invalid byte sequence in UTF-8 (ArgumentError)

require 'hpricot'
require 'open-uri'

doc = open('http://www.amazon.co.jp/') {|f| Hpricot(f.read) }
puts doc.to_html

open('http://www.amazon.co.jp/') {|f| Hpricot(f.read.encode("UTF-8")) }

Upvotes: 3

emboss
emboss

Reputation: 39660

I know how it could work with Net::HTTP (Ruby 1.9.2):

require 'net/http'
require 'uri'

url = URI.parse('http://www.baidu.com')
res = Net::HTTP.start(url.host, url.port) {|http|
  http.get('/')
}
str = res.body.force_encoding('GB2312')
puts str
puts str.encoding.name # => GB2312

Does that help?

Upvotes: 0

Related Questions