How to remove a node using Nokogiri

Question

I have an HTML structure like this:


  This is
   very
    
  
   important.

I know how to get a Nokogiri::XML::NodeSet from this:

dom.xpath("//div")

I now want to filter out any script tag:

dom.xpath("//script")

So I can get something like:


  This is
   very
   important.

So that I can call div.text to get:

"This is very important."

I tried recursively/iteratively going over all children nodes and trying to match every node I want to filter out any node I don't want, but I ran into problems like too much whitespace or not enough whitespace. I'm quite sure there's a nice enough and rubyesque way.

What would be a good way to do this?

Eric Duminil · Accepted Answer

1st problem

To remove all the script nodes :

require 'nokogiri'

html = "
  This is
   very
    
  
   important.
"

doc = Nokogiri::HTML(html)

doc.xpath("//script").remove

p doc.text
#=> "
  This is
   very
    
  
   important.
"

Thanks to @theTinMan for his tip (calling remove on one NodeSet instead of each Node).

2nd problem

To remove the unneeded whitespaces, you can use :

strip to remove spaces (whitespace, tabs, newlines, ...) at beginning and end of string
gsub to replace mutiple spaces by just one whitespace

p doc.text.strip.gsub(/[[:space:]]+/,' ')
#=> "This is very important."

How to remove a node using Nokogiri

Answers (2)

1st problem

2nd problem

Related Questions