Reputation: 154
I have an HTML structure like this:
<div>
This is
<p> very
<script>
some code
</script>
</p>
important.
</div>
I know how to get a Nokogiri::XML::NodeSet
from this:
dom.xpath("//div")
I now want to filter out any script
tag:
dom.xpath("//script")
So I can get something like:
<div>
This is
<p> very</p>
important.
</div>
So that I can call div.text
to get:
"This is very important."
I tried recursively/iteratively going over all children nodes and trying to match every node I want to filter out any node I don't want, but I ran into problems like too much whitespace or not enough whitespace. I'm quite sure there's a nice enough and rubyesque way.
What would be a good way to do this?
Upvotes: 2
Views: 2986
Reputation: 54313
To remove all the script nodes :
require 'nokogiri'
html = "<div>
This is
<p> very
<script>
some code
</script>
</p>
important.
</div>"
doc = Nokogiri::HTML(html)
doc.xpath("//script").remove
p doc.text
#=> "\n This is\n very\n \n \n important.\n"
Thanks to @theTinMan for his tip (calling remove
on one NodeSet instead of each Node).
To remove the unneeded whitespaces, you can use :
strip
to remove spaces (whitespace, tabs, newlines, ...) at beginning and end of stringgsub
to replace mutiple spaces by just one whitespacep doc.text.strip.gsub(/[[:space:]]+/,' ')
#=> "This is very important."
Upvotes: 1
Reputation: 160631
NodeSet contains the remove
method which makes it easy to remove whatever matched your selector:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<div><p>foo</p><p>bar</p></div>
</body>
</html>
EOT
doc.search('p').remove
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >> <body>
# >> <div></div>
# >> </body>
# >> </html>
Applied to your sample input:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div>
This is
<p> very
<script>
some code
</script>
</p>
important.
</div>
EOT
doc.search('script').remove
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <div>
# >> This is
# >> <p> very
# >>
# >> </p>
# >> important.
# >> </div>
# >> </body></html>
At that point the text in the <div>
is:
doc.at('div').text # => "\n This is\n very\n \n \n important.\n"
Normalizing that is easy:
doc.at('div').text.gsub(/[\n ]+/,' ').strip # => "This is very important."
Upvotes: 2