Issue parsing HTML using Nokogiri

Question

I have some HTML and wish to get the content under the element. However, with whatever I tried, after the HTML is parsed using Nokogiri, everything inside and is also becoming part of the element and when I retrieve the element, I see stuff inside and the and

Hello World

The solution I am using is:

parsed_html = Nokogiri::HTML(my_html)
body_tag_content = parsed_html.at('body')
puts body_tag_content.inner_html

What am I getting:

about:legacy-compat">





Some title








Hello World

What am I expecting:

Hello World

Any idea what's happening in here?

DiegoSalazar · Accepted Answer

I got your example to work by first cleaning up the original HTML. I removed the "about:legacy-compat" from the Doctype which seemed to be messing Nokogiri up:

# clean up the junk in the doctype
my_html.sub!(""about:legacy-compat"", "")

# parse and get the body
parsed_html = Nokogiri::HTML(my_html)
body_tag_content = parsed_html.at('body')

puts body_tag_content.inner_html
# => "
      Hello World
      
   "

In general, when you're parsing potentially dirty third-party data such as HTML, you should clean it up first so the parser doesn't choke and do unexpected things. You could run the HTML through a linter or "tidy" tool to try and automatically clean it up. When all else fails, you'll have to clean it by hand as above.

HTML tidy/cleaning in Ruby 1.9

Issue parsing HTML using Nokogiri

Answers (1)

Related Questions