Sarp Kaya
Sarp Kaya

Reputation: 3784

Scraping entire HTML tags with Nokogiri

I've searched everywhere and all I found was to do CSS selection with Nokogiri, what I am after is simply to get rid off all HTML tags.

For example this:

<html>
   <head><title>My webpage</title></head>
   <body>
   <h1>Hello Webpage!</h1>
   <div id="references">
      <p><a href="http://www.google.com">Click here</a> to go to the search engine Google</p>
      <p>Or you can <a href="http://www.bing.com">click here to go</a> to Microsoft Bing.</p>
      <p>Don't want to learn Ruby? Then give <a href="http://learnpythonthehardway.org/">Zed Shaw's Learn Python the Hard Way</a> a try</p>
   </div>

   <div id="funstuff">
      <p>Here are some entertaining links:</p>
      <ul>
         <li><a href="http://youtube.com">YouTube</a></li>
         <li><a data-category="news" href="http://reddit.com">Reddit</a></li>
         <li><a href="http://kathack.com/">Kathack</a></li>
         <li><a data-category="news" href="http://www.nytimes.com">New York Times</a></li>
      </ul>
   </div>

   <p>Thank you for reading my webpage!</p>

   </body>
<p>Addition</p>
</html> 
Extra content

Should ouptut as:

Hello Webpage!

Click here to go to the search engine Google

Or you can click here to go to Microsoft Bing.

Don't want to learn Ruby? Then give Zed Shaw's Learn Python the Hard Way a try

Here are some entertaining links:

YouTube
Reddit
Kathack
New York Times
Thank you for reading my webpage!
Addition
Extra content

How can I do that by using Nokogiri? Also what else can I do to scrape other code such as Javascript?

Upvotes: 2

Views: 786

Answers (2)

Agustin
Agustin

Reputation: 1254

There are many ways to do what you would like, I would look into using Loofah which wraps Nokogiri under the hood.

In Loofah you would do something like:

document = Loofah.fragment(html)
document.scrub!(:prune).text

Prune scrub removes all the unsafe tags and subtrees, and text outputs new line character per node.

Upvotes: 0

Kalman
Kalman

Reputation: 8121

require 'nokogiri'

html = %q{ 
  <html>
   <head><title>My webpage</title></head>
   <body>
   <h1>Hello Webpage!</h1>
   <div id="references">
     <p><a href="http://www.google.com">Click here</a> to go to the search engine Google</p>
     <p>Or you can <a href="http://www.bing.com">click here to go</a> to Microsoft Bing.</p>
     <p>Don't want to learn Ruby? Then give <a href="http://learnpythonthehardway.org/">Zed Shaw's    Learn Python the Hard Way</a> a try</p>
   </div>

   <div id="funstuff">
    <p>Here are some entertaining links:</p>
    <ul>
     <li><a href="http://youtube.com">YouTube</a></li>
     <li><a data-category="news" href="http://reddit.com">Reddit</a></li>
     <li><a href="http://kathack.com/">Kathack</a></li>
     <li><a data-category="news" href="http://www.nytimes.com">New York Times</a></li>
     </ul>
   </div>

   <p>Thank you for reading my webpage!</p>

   </body>
</html>
}

doc = Nokogiri::XML(html)
body = doc.search('body')
puts body.text.gsub(/<.*?\/?>/, '')

Upvotes: 1

Related Questions