gsub :: ArgumentError (invalid byte sequence in UTF-8)

Question

This code uses the Hpricot gem to get HTML that contains UTF-8 characters.

# This is a test测试
div[0].to_html.gsub(/test/, "")

When that is run, it spits out this error (pointing at gsub):

ArgumentError (invalid byte sequence in UTF-8)

How can we fix this issue?

Artem Kalinchuk · Accepted Answer

Figured out the issue. Hpricot's to_html calls methods that trigger the error so to get rid of that we need to make the Hpricot document encoding UTF-8, not just that one string. We do that like this:

ic = Iconv.new("UTF-8//IGNORE", "UTF-8")
doc = open("http://example.com") {|f| Hpricot(ic.iconv(f.read)) }

And then we can call other Hpricot methods but now the whole document has UTF-8 encoding and it won't give us any errors.

gsub :: ArgumentError (invalid byte sequence in UTF-8)

Answers (2)

Related Questions