Reputation: 3784
I've searched everywhere and all I found was to do CSS selection with Nokogiri, what I am after is simply to get rid off all HTML tags.
For example this:
<html>
<head><title>My webpage</title></head>
<body>
<h1>Hello Webpage!</h1>
<div id="references">
<p><a href="http://www.google.com">Click here</a> to go to the search engine Google</p>
<p>Or you can <a href="http://www.bing.com">click here to go</a> to Microsoft Bing.</p>
<p>Don't want to learn Ruby? Then give <a href="http://learnpythonthehardway.org/">Zed Shaw's Learn Python the Hard Way</a> a try</p>
</div>
<div id="funstuff">
<p>Here are some entertaining links:</p>
<ul>
<li><a href="http://youtube.com">YouTube</a></li>
<li><a data-category="news" href="http://reddit.com">Reddit</a></li>
<li><a href="http://kathack.com/">Kathack</a></li>
<li><a data-category="news" href="http://www.nytimes.com">New York Times</a></li>
</ul>
</div>
<p>Thank you for reading my webpage!</p>
</body>
<p>Addition</p>
</html>
Extra content
Should ouptut as:
Hello Webpage!
Click here to go to the search engine Google
Or you can click here to go to Microsoft Bing.
Don't want to learn Ruby? Then give Zed Shaw's Learn Python the Hard Way a try
Here are some entertaining links:
YouTube
Reddit
Kathack
New York Times
Thank you for reading my webpage!
Addition
Extra content
How can I do that by using Nokogiri? Also what else can I do to scrape other code such as Javascript?
Upvotes: 2
Views: 786
Reputation: 1254
There are many ways to do what you would like, I would look into using Loofah which wraps Nokogiri under the hood.
In Loofah you would do something like:
document = Loofah.fragment(html)
document.scrub!(:prune).text
Prune scrub removes all the unsafe tags and subtrees, and text outputs new line character per node.
Upvotes: 0
Reputation: 8121
require 'nokogiri'
html = %q{
<html>
<head><title>My webpage</title></head>
<body>
<h1>Hello Webpage!</h1>
<div id="references">
<p><a href="http://www.google.com">Click here</a> to go to the search engine Google</p>
<p>Or you can <a href="http://www.bing.com">click here to go</a> to Microsoft Bing.</p>
<p>Don't want to learn Ruby? Then give <a href="http://learnpythonthehardway.org/">Zed Shaw's Learn Python the Hard Way</a> a try</p>
</div>
<div id="funstuff">
<p>Here are some entertaining links:</p>
<ul>
<li><a href="http://youtube.com">YouTube</a></li>
<li><a data-category="news" href="http://reddit.com">Reddit</a></li>
<li><a href="http://kathack.com/">Kathack</a></li>
<li><a data-category="news" href="http://www.nytimes.com">New York Times</a></li>
</ul>
</div>
<p>Thank you for reading my webpage!</p>
</body>
</html>
}
doc = Nokogiri::XML(html)
body = doc.search('body')
puts body.text.gsub(/<.*?\/?>/, '')
Upvotes: 1