tech_human
tech_human

Reputation: 7154

Issue parsing HTML using Nokogiri

I have some HTML and wish to get the content under the <body> element. However, with whatever I tried, after the HTML is parsed using Nokogiri, everything inside <doctype> and <head> is also becoming part of the <body> element and when I retrieve the <body> element, I see stuff inside <doctype> and the <meta> and <script> tags too.

My original HTML is:

 <!DOCTYPE html \"about:legacy-compat\">
<html>
   <head>
      <meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">
      <title>Some Title</title>
      <meta name='viewport' id='helloviewport' content='initial-scale=1.0,maximum-scale=2.5' />
      <link rel='stylesheet' id='hello-stylesheet' type='text/css' href='some-4ac294cd125e1a062562aca1c83714ff.css'/>
      <script id='hello-javascript' type='text/javascript' src='/hello/hello.js'></script>
   </head>
   <body marginwidth=\"6\" marginheight=\"6\" leftmargin=\"6\" topmargin=\"6\">
      <div class=\"hello-status\">Hello World</div>
      <div valign=\"top\"></div>
   </body>
</html>

The solution I am using is:

parsed_html = Nokogiri::HTML(my_html)
body_tag_content = parsed_html.at('body')
puts body_tag_content.inner_html

What am I getting:

<p>about:legacy-compat\"&gt;</p>
\n
<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">
\n
<title>Some title</title>
\n
<meta name='viewport' id='helloviewport' content='initial-scale=1.0,maximum-scale=2.5' />
\n
<link rel='stylesheet' id='hello-stylesheet' type='text/css' href='some-4ac294cd125e1a062562aca1c83714ff.css'/>
\n<script id='hello-javascript' type='text/javascript' src='/hello/hello.js'></script>
<div class=\"hello-status\">Hello World</div>
\n
<div valign=\"top\">\n\n</div>

What am I expecting:

<div class=\"hello-status\">Hello World</div>
\n
<div valign=\"top\">\n\n</div>

Any idea what's happening in here?

Upvotes: 2

Views: 89

Answers (1)

DiegoSalazar
DiegoSalazar

Reputation: 13531

I got your example to work by first cleaning up the original HTML. I removed the "about:legacy-compat" from the Doctype which seemed to be messing Nokogiri up:

# clean up the junk in the doctype
my_html.sub!("\"about:legacy-compat\"", "")

# parse and get the body
parsed_html = Nokogiri::HTML(my_html)
body_tag_content = parsed_html.at('body')

puts body_tag_content.inner_html
# => "\n      <div class=\"hello-status\">Hello World</div>\n      <div valign=\"top\"></div>\n   "

In general, when you're parsing potentially dirty third-party data such as HTML, you should clean it up first so the parser doesn't choke and do unexpected things. You could run the HTML through a linter or "tidy" tool to try and automatically clean it up. When all else fails, you'll have to clean it by hand as above.

HTML tidy/cleaning in Ruby 1.9

Upvotes: 1

Related Questions