Light Yagmi
Light Yagmi

Reputation: 5235

Sanitize JS scripts inside script tag of html file on ruby

I try extracting contents from html file using Ruby (not RoR)

I was doing this:

require 'sanitize'
require 'nokogiri'

doc = doc = Nokogiri::HTML(html_document)
a = Sanitize.fragment(doc.css('body'))

This extract contents inside the <body> tag, and remove all html tags. But, unfortunately, JS scripts still remain which existed inside <script> tag.

How do I remove JS scripts in addition to html tags?

Upvotes: 0

Views: 284

Answers (1)

guitarman
guitarman

Reputation: 3320

I assume your are using the newest version of Sanitize.

html = "<html><head><title></title><style>.red{color:red;}</style></head><body><div>... <b>some content</b> ...</div><script>... a script ...</script></body></html>"

Sanitize.fragment(html, :remove_contents => ['script'])
# => ".red{color:red;} ... some content ... "

Sanitize.fragment(html, :remove_contents => ['script', 'style'])
# => " ... some content ... "

Please see: :remove_contents

Upvotes: 1

Related Questions