Reputation: 5235
I try extracting contents from html file using Ruby (not RoR)
I was doing this:
require 'sanitize'
require 'nokogiri'
doc = doc = Nokogiri::HTML(html_document)
a = Sanitize.fragment(doc.css('body'))
This extract contents inside the <body>
tag, and remove all html tags. But, unfortunately, JS scripts still remain which existed inside <script>
tag.
How do I remove JS scripts in addition to html tags?
Upvotes: 0
Views: 284
Reputation: 3320
I assume your are using the newest version of Sanitize.
html = "<html><head><title></title><style>.red{color:red;}</style></head><body><div>... <b>some content</b> ...</div><script>... a script ...</script></body></html>"
Sanitize.fragment(html, :remove_contents => ['script'])
# => ".red{color:red;} ... some content ... "
Sanitize.fragment(html, :remove_contents => ['script', 'style'])
# => " ... some content ... "
Please see: :remove_contents
Upvotes: 1