Reputation: 710
I have seen whitelist based sanitizers for HTML in Ruby, but I need the opposite, I need ONLY links removed from a page to be readied for PDF conversion. I tried Sanitize, but it does not fit what I need as it is too difficult to guess what HTML elements will be used on the fetched page, so that I can add them to the list.
If my input was
<a href="link">Link!</a>
<b>Bold Text</b>
<div>A div!</div>
I would want
Link!
<b>Bold Text</b>
<div>A div!</div>
to be the output.
Is there any 'blacklist-based sanitizer' for Ruby?
Upvotes: 1
Views: 3000
Reputation: 2768
html_without_links = remove_tags("<a href="link">Link!</a><b>Bold Text</b><div>A div!</div>",'a')
You can use the method above with the code below and you should get what you want.
require 'nokogiri'
def is_html?(text)
stripped_text = Nokogiri::HTML(text).text.strip
return !stripped_text.eql?(text)
end
def remove_tags(message_string,tag=nil)
return message_string if message_string.blank? || tag.blank? || !is_html?(message_string)
html_doc = Nokogiri.HTML(message_string)
html_doc.search(tag).each do |a|
a.replace(a.content)
end
html_doc.text
end
Upvotes: 0
Reputation: 3893
Rails 4.2 can do this out of the box. For older versions gem 'rails-html-sanitizer'
is required
white_list_sanitizer = Rails::Html::WhiteListSanitizer.new
white_list_sanitizer.sanitize(@article.body, tags: %w(table tr td), attributes: %w(id class style))
or use Loofah's TargetScrubber
Rails::Html::TargetScrubber
Where PermitScrubber picks out tags and attributes to permit in sanitization, Rails::Html::TargetScrubber targets them for removal.
scrubber = Rails::Html::TargetScrubber.new
scrubber.tags = ['img']
html_fragment = Loofah.fragment('<a><img/ ></a>')
html_fragment.scrub!(scrubber)
html_fragment.to_s # => "<a></a>"
Upvotes: 1
Reputation: 303205
Minor variation on the Tin Man's answer, still using Nokogiri:
require 'nokogiri' # gem install nokogiri
doc = Nokogiri.HTML( my_html )
doc.css('a,blink,marquee').each do |el|
el.replace( el.inner_html )
end
cleaned = doc.to_html
The two differences here are:
Using css
over search
to be slightly more specific about the selectors being used (though it offers no functional difference), but more importantly
By replacing with inner_html
we preserve possible markup inside the link. For example, given the markup:
<p><a href="foo">Hi <b>Mom</b></a>!</p>
then replacing with .content
would produce:
<p>Hi Mom!</p>
whereas replacing with .inner_html
produces:
<p>Hi <b>Mom</b>!</p>
Upvotes: 3
Reputation: 160551
You want a HTML parser, such as Nokogiri. It lets you walk through the document, searching for specific nodes ("tags") and do things to them:
require 'nokogiri'
html = '<a href="link">Link!</a>
<b>Bold Text</b>
<div>A div!</div>
'
doc = Nokogiri.HTML(html)
doc.search('a').each do |a|
a.replace(a.content)
end
puts doc.to_html
Which results in:
<html><body>Link!
<b>Bold Text</b>
<div>A div!</div>
</body></html>
Notice that Nokogiri did some fixups to the code, supplying the appropriate <html>
and <body>
tags. It doesn't have to, I could have told it to use and return a document fragment, but usually we let it do its thing.
Upvotes: 2