user1785060
user1785060

Reputation: 3

Parsing HTML with a weird encoding with Nokogiri

I can't use XPath because the encoding gets weird. I hoped you could help me out of this trouble.

require "Nokogiri"
require "open-uri"
link = "http://www.arla.dk/Services/SearchService.asmx/RecipeResult?q=allRecipe&paging=6&include=&exclude=&area=recipeSearch&languageBranch=da"
doc = Nokogiri::HTML(open(link))
doc.xpath("//h2")

The xpath method returns an empty array. It looks like the document has not been parsed correct. I think it is due to the file being parsed contains the encoded characters:

<strong>Frokost til 8</strong>
<ul><li class='ingHeading'><strong><b>Flade

Upvotes: 0

Views: 841

Answers (2)

AJcodez
AJcodez

Reputation: 34156

As stated above, the issue is that the HTML is encoded, which is why you are seeing escape sequences; For example, &lt; instead of <. To get around it, unescape the HTML.

"How do I encode/decode HTML entities in Ruby? basically suggests using htmlentities.

Upvotes: 0

pguardiario
pguardiario

Reputation: 54984

The response is XML so first parse it with Nokogiri::XML:

xml = Nokogiri::XML open(link)

then the first string contains some HTML so parse that with Nokogiri::HTML

doc = Nokogiri::HTML xml.at('string').text

Now you can do your search:

doc.xpath '//h2'

Upvotes: 1

Related Questions