Reputation: 4268
I need to trim empty spaces above and after the last tag with text/content. I want to control the content displayed to the client and not "break" the visual.
<p> <br> </p> ~> remove
<p> <br> </p> ~> remove
<p> Text <p>
<p> <br> </p> ~> should preserve only this of the empty tags
<p> Text </p>
<p> Text </p>
<p> <br> </p> ~> remove
<p> <br> </p> ~> remove
<p> <br> </p> ~> remove
I'm using Sanitize and it has the ability of being passed a transfomer. The documentation shows an example snippet to remove all empty elements.
To remove empty elements before any regular element, I thought I could assign a variable to control when it stops removing the empty tags:
should_remove_empty = true
lambda {|env|
node = env[:node]
return unless node.elem?
unless node.children.any?{|c| c.text? && c.content.strip.length > 0 || !c.text? }
node.unlink if should_remove_empty
else
should_remove_empty = false
end
}
But now, to remove the tail empty elements, I should iterate it upside down. But Sanitize doesn't give me this ability.
Does anyone know how to do this, or has anyone already implemented it?
Upvotes: 1
Views: 1153
Reputation: 48599
I'm using https://github.com/rgrove/sanitize
From the README:
Sanitize is a whitelist-based HTML sanitizer. Given a list of acceptable elements and attributes, Sanitize will remove all unacceptable HTML from a string.
That won't work for you because sometimes you want to keep the elements that are unacceptable.
require 'nokogiri'
doc = Nokogiri::HTML(<<END_OF_HTML)
<body>
<p> <br> </p>
<p> <br> </p>
<p> Text </p>
<p> <br> </p>
<p> Text </p>
<p> Text </p>
<p> <br> </p>
<p> <br> </p>
<p> <br> </p>
</body>
END_OF_HTML
ps = doc.xpath '/html/body/p'
first_text = -1
last_text = 0
ps.each_with_index do |p, i|
if not p.at_xpath('child::text()').text.strip.empty? #then found some text
first_text = i if first_text == -1
last_text = i
end
end
puts ps.slice(first_text .. last_text)
--output:--
<p> Text </p>
<p> <br></p>
<p> Text </p>
<p> Text </p>
Upvotes: 1