jayp
jayp

Reputation: 362

How to use Nokogiri to get the full HTML without any text content

I'm trying to use Nokogiri to get a page's full HTML but with all of the text stripped out.

I tried this:

require 'nokogiri'
x = "<html>  <body>  <div class='example'><span>Hello</span></div></body></html>"
y = Nokogiri::HTML.parse(x).xpath("//*[not(text())]").each { |a| a.children.remove }
puts y.to_s

This outputs:

<div class="example"></div>

I've also tried running it without the children.remove part:

y = Nokogiri::HTML.parse(x).xpath("//*[not(text())]")
puts y.to_s

But then I get:

<div class="example"><span>Hello</span></div>

But what I actually want is:

<html><body><div class='example'><span></span></div></body></html>

Upvotes: 0

Views: 512

Answers (1)

ezkl
ezkl

Reputation: 3851

NOTE: This is a very aggressive approach. Tags like <script>, <style>, and <noscript> also have child text() nodes containing CSS, HTML, and JS that you might not want to filter out depending on your use case.

If you operate on the parsed document instead of capturing the return value of your iterator, you'll be able to remove the text nodes, and then return the document:

require 'nokogiri'
html = "<html>  <body>  <div class='example'><span>Hello</span></div></body></html>"

# Parse HTML
doc = Nokogiri::HTML.parse(html)

puts doc.inner_html
# => "<html>  <body>  <div class=\"example\"><span>Hello</span></div>\n</body>\n</html>"

# Remove text nodes from parsed document
doc.xpath("//text()").each { |t| t.remove }

puts doc.inner_html
# => "<html><body><div class=\"example\"><span></span></div></body></html>"

Upvotes: 2

Related Questions