Rebecca
Rebecca

Reputation: 710

Remove specific HTML elements in Ruby

I have seen whitelist based sanitizers for HTML in Ruby, but I need the opposite, I need ONLY links removed from a page to be readied for PDF conversion. I tried Sanitize, but it does not fit what I need as it is too difficult to guess what HTML elements will be used on the fetched page, so that I can add them to the list.

If my input was

<a href="link">Link!</a>
<b>Bold Text</b>
<div>A div!</div>

I would want

Link!
<b>Bold Text</b>
<div>A div!</div>

to be the output.

Is there any 'blacklist-based sanitizer' for Ruby?

Upvotes: 1

Views: 3000

Answers (4)

Lucas Chwe
Lucas Chwe

Reputation: 2768

html_without_links = remove_tags("<a href="link">Link!</a><b>Bold Text</b><div>A div!</div>",'a')

You can use the method above with the code below and you should get what you want.

require 'nokogiri'

def is_html?(text)
  stripped_text = Nokogiri::HTML(text).text.strip
  return !stripped_text.eql?(text)
end

def remove_tags(message_string,tag=nil)
  return message_string if message_string.blank? || tag.blank? || !is_html?(message_string)
  html_doc = Nokogiri.HTML(message_string)
  html_doc.search(tag).each do |a|
    a.replace(a.content)
  end

  html_doc.text
end

Upvotes: 0

Nino van Hooff
Nino van Hooff

Reputation: 3893

Rails 4.2 can do this out of the box. For older versions gem 'rails-html-sanitizer' is required

white list only the supplied tags and attributes

white_list_sanitizer = Rails::Html::WhiteListSanitizer.new
white_list_sanitizer.sanitize(@article.body, tags: %w(table tr td), attributes: %w(id class style))

or use Loofah's TargetScrubber

Rails::Html::TargetScrubber

Where PermitScrubber picks out tags and attributes to permit in sanitization, Rails::Html::TargetScrubber targets them for removal.

scrubber = Rails::Html::TargetScrubber.new
scrubber.tags = ['img']

html_fragment = Loofah.fragment('<a><img/ ></a>')
html_fragment.scrub!(scrubber)
html_fragment.to_s # => "<a></a>"

Rails HTML sanitizer

Upvotes: 1

Phrogz
Phrogz

Reputation: 303205

Minor variation on the Tin Man's answer, still using Nokogiri:

require 'nokogiri' # gem install nokogiri
doc = Nokogiri.HTML( my_html )
doc.css('a,blink,marquee').each do |el|
  el.replace( el.inner_html )
end
cleaned = doc.to_html

The two differences here are:

  1. Using css over search to be slightly more specific about the selectors being used (though it offers no functional difference), but more importantly

  2. By replacing with inner_html we preserve possible markup inside the link. For example, given the markup:

    <p><a href="foo">Hi <b>Mom</b></a>!</p>
    

    then replacing with .content would produce:

    <p>Hi Mom!</p>
    

    whereas replacing with .inner_html produces:

    <p>Hi <b>Mom</b>!</p>
    

Upvotes: 3

the Tin Man
the Tin Man

Reputation: 160551

You want a HTML parser, such as Nokogiri. It lets you walk through the document, searching for specific nodes ("tags") and do things to them:

require 'nokogiri'

html = '<a href="link">Link!</a>
<b>Bold Text</b>
<div>A div!</div>
'

doc = Nokogiri.HTML(html)

doc.search('a').each do |a|
  a.replace(a.content)
end

puts doc.to_html

Which results in:

<html><body>Link!
<b>Bold Text</b>
<div>A div!</div>
</body></html>

Notice that Nokogiri did some fixups to the code, supplying the appropriate <html> and <body> tags. It doesn't have to, I could have told it to use and return a document fragment, but usually we let it do its thing.

Upvotes: 2

Related Questions