Dan Watson
Dan Watson

Reputation: 90

Nokogiri - Encoding Issue - Invalid UTF8 characters

Can someone take a look at this. I think there is invalid UTF-8 characters when making this call.

Nokogiri::HTML(open("http://www.next.co.uk/x502062s2"))

If there a way around this? And is this the issue? I am writing a new open source screen scraper designed for product information capture (when a site does not supply a feed) before anyone says I am doing something a little shifty :-)

Upvotes: 0

Views: 1197

Answers (1)

sparrovv
sparrovv

Reputation: 7774

Before passing anything to Nokogiri, you can encode the content of the page, and ignore all invalid UTF characters using Iconv.

I was using it like this:

ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid_string = ic.iconv(open('http://example.com').read)

You can also check "Fixing invalid UTF-8 in Ruby, revisited."

Upvotes: 2

Related Questions