Reputation: 2293
I'm trying to parse URLs and return text in the original format.
I'm using this gem: https://github.com/cantino/ruby-readability
Here is what I have:
require 'rubygems'
require 'readability'
require 'open-uri'
source = open(@string).read
@text = Readability::Document.new(source).content
This just gives me the text with the html tags for the formatting.
I've tried:
@text = Readability::Document.new(source, tags: []).content
This just strips the text of the html tags. The text is all crammed together. I'm trying to grab the text and keep the line breaks and spaces without any html tags. I'm trying to then use the text to process under some additional algorithms. I'm not displaying the text in any views. If I was going to display the text in views, I would just call on the simple_format helper.
For example for this URL: http://blogs.discovermagazine.com/d-brief/2014/08/01/physicist-invents-color-changing-ice-cream/
I would like to save the text in it's unaltered format:
@text = %Q[Believe it or not, it is possible to make ice cream even better. Manuel Linares, a former physicist turned cook, has invented a variant on the classic treat that changes colors as you lick it. The new creamy concoction called Xamaleón — an homage to “chameleon” — transitions from periwinkle to pink when it touches the tongue, and tastes similar to “tutti-frutti,” Phys.org reports. The ice cream’s colorful trick relies on both changes in temperature and reactions to acids in the human mouth. However, Linares isn’t revealing any more details about his secret recipe. What we do know is the ice cream is made with natural ingredients like strawberries, banana, vanilla and almonds. Additionally, Linares sprays what he calls a “love elixir” on the ice cream after it’s scooped to help accelerate the reaction. We probably won’t know the whole story behind Xamaleón until Linares secures a patent for his creation, which is pending. “As a physicist I know that there are various possibilities that might work and I was delighted when I managed to crack it and create an ice cream that changes color,” Linares said. Earlier this year Linares opened an ice cream shop in Blanes, his hometown in Spain, and has plans for more exotic ice cream flavors in the future. Up next, he says: An ice cream made with Peruvian and African medicinal plants that will provide an aphrodisiac effect.]
Upvotes: 1
Views: 420
Reputation: 160631
At its most basic, you can try:
require 'nokogiri'
@text = Readability::Document.new(
Nokogiri::HTML(open('url_to_content')).text
).content
Nokogiri is the defacto standard for parsing XML and HTML in Ruby.
Nokogiri::HTML(open('url_to_content'))
is the basis for 99.99% of how we'd parse a web page. text
returns the text nodes in the document.
That said, you're going to have to dive into the page and extract only the section(s) containing the text you really want, because the page itself had links to other pages, advertising and whatnot, which can all be returned by using text
against the root node.
It looks like a CSS selector like 'div.entry p'
would get you close:
doc.search('div.entry p').text
Returns:
doc.search('div.entry p').text "The color-changing flavor, Xamaleon. (Credit: Manual Linares, Cocinatis)Believe it or not, it is possible to make ice cream even better. Manuel Linares, a former physicist turned cook, has invented a variant on the classic treat that changes colors as you lick it.The new creamy concoction called Xamaleón — an homage to “chameleon” — transitions from periwinkle to pink when it touches the tongue, and tastes similar to “tutti-frutti,” Phys.org reports. The ice cream’s colorful trick relies on both changes in temperature and reactions to acids in the human mouth. However, Linares isn’t revealing any more details about his secret recipe.What we do know is the ice cream is made with natural ingredients like strawberries, banana, vanilla and almonds. Additionally, Linares sprays what he calls a “love elixir” on the ice cream after it’s scooped to help accelerate the reaction. We probably won’t know the whole story behind Xamaleón until Linares secures a patent for his creation, which is pending. \n“As a physicist I know that there are various possibilities that might work and I was delighted when I managed to crack it and create an ice cream that changes color,” Linares said.Earlier this year Linares opened an ice cream shop in Blanes, his hometown in Spain, and has plans for more exotic ice cream flavors in the future. Up next, he says: An ice cream made with Peruvian and African medicinal plants that will provide an aphrodisiac effect.Interesting, but I hope he’s tested thoroughly to make sure it’s safe for consumption in the long term. It would be pretty difficult to do safety testing if you’re not revealing the recipe. And the last part, the ‘African and Peruvian herbs’ sounds so snake oil salesmanly that I’m almost calling BS on the whole article.Thanks for your expert opinion!I definitely see your point. How dare he quit his job, venture into the booming food intusdry, and make something totally amazing only to keep the recipe a secret so some asshat can’t just steal his idea. What a jerk.Yes yes the physicist is coming off all snakey oily and salesmanly. The guy is selling ice cream.There are plenty of naturally occurring aphrodisiacs. You shouldn’t judge the whole article just because you didn’t know, or don’t agree.I kinda go in the same direction with the safety testing if any synthetic ingredients are included, and transparency is not. But if the recipe is all natural with only food ingredients found in other foods, (including the accelerator elixir..) such as red cabbage, fructose, etc to mimic Ph paper, it’s be a different story….Looks like red cabbage PH indicator at work. I don’t think Mr. Linares’ physics degree came into play in “inventing” this.“Love elixir” had better not be what it sounds like…mmmm love elixer,.."
Printing the output looks a little better, and shows there are some line-ends embedded in the text:
puts doc.search('div.entry p').text The color-changing flavor, Xamaleon. (Credit: Manual Linares, Cocinatis)Believe it or not, it is possible to make ice cream even better. Manuel Linares, a former physicist turned cook, has invented a variant on the classic treat that changes colors as you lick it.The new creamy concoction called Xamaleón — an homage to “chameleon” — transitions from periwinkle to pink when it touches the tongue, and tastes similar to “tutti-frutti,” Phys.org reports. The ice cream’s colorful trick relies on both changes in temperature and reactions to acids in the human mouth. However, Linares isn’t revealing any more details about his secret recipe.What we do know is the ice cream is made with natural ingredients like strawberries, banana, vanilla and almonds. Additionally, Linares sprays what he calls a “love elixir” on the ice cream after it’s scooped to help accelerate the reaction. We probably won’t know the whole story behind Xamaleón until Linares secures a patent for his creation, which is pending. “As a physicist I know that there are various possibilities that might work and I was delighted when I managed to crack it and create an ice cream that changes color,” Linares said.Earlier this year Linares opened an ice cream shop in Blanes, his hometown in Spain, and has plans for more exotic ice cream flavors in the future. Up next, he says: An ice cream made with Peruvian and African medicinal plants that will provide an aphrodisiac effect.Interesting, but I hope he’s tested thoroughly to make sure it’s safe for consumption in the long term. It would be pretty difficult to do safety testing if you’re not revealing the recipe. And the last part, the ‘African and Peruvian herbs’ sounds so snake oil salesmanly that I’m almost calling BS on the whole article.Thanks for your expert opinion!I definitely see your point. How dare he quit his job, venture into the booming food intusdry, and make something totally amazing only to keep the recipe a secret so some asshat can’t just steal his idea. What a jerk.Yes yes the physicist is coming off all snakey oily and salesmanly. The guy is selling ice cream.There are plenty of naturally occurring aphrodisiacs. You shouldn’t judge the whole article just because you didn’t know, or don’t agree.I kinda go in the same direction with the safety testing if any synthetic ingredients are included, and transparency is not. But if the recipe is all natural with only food ingredients found in other foods, (including the accelerator elixir..) such as red cabbage, fructose, etc to mimic Ph paper, it’s be a different story….Looks like red cabbage PH indicator at work. I don’t think Mr. Linares’ physics degree came into play in “inventing” this.“Love elixir” had better not be what it sounds like…mmmm love elixer,..
If you want to get a better idea of the text as displayed, then the <p>
tag's display of a trailing blank line has to be accommodated also, which is easy after a small tweak to the code:
[9] (pry) main: 0> puts doc.search('div.entry p').map(&:text) The color-changing flavor, Xamaleon. (Credit: Manual Linares, Cocinatis) Believe it or not, it is possible to make ice cream even better. Manuel Linares, a former physicist turned cook, has invented a variant on the classic treat that changes colors as you lick it. The new creamy concoction called Xamaleón — an homage to “chameleon” — transitions from periwinkle to pink when it touches the tongue, and tastes similar to “tutti-frutti,” Phys.org reports. The ice cream’s colorful trick relies on both changes in temperature and reactions to acids in the human mouth. However, Linares isn’t revealing any more details about his secret recipe. What we do know is the ice cream is made with natural ingredients like strawberries, banana, vanilla and almonds. Additionally, Linares sprays what he calls a “love elixir” on the ice cream after it’s scooped to help accelerate the reaction. We probably won’t know the whole story behind Xamaleón until Linares secures a patent for his creation, which is pending. “As a physicist I know that there are various possibilities that might work and I was delighted when I managed to crack it and create an ice cream that changes color,” Linares said. Earlier this year Linares opened an ice cream shop in Blanes, his hometown in Spain, and has plans for more exotic ice cream flavors in the future. Up next, he says: An ice cream made with Peruvian and African medicinal plants that will provide an aphrodisiac effect. Interesting, but I hope he’s tested thoroughly to make sure it’s safe for consumption in the long term. It would be pretty difficult to do safety testing if you’re not revealing the recipe. And the last part, the ‘African and Peruvian herbs’ sounds so snake oil salesmanly that I’m almost calling BS on the whole article. Thanks for your expert opinion! I definitely see your point. How dare he quit his job, venture into the booming food intusdry, and make something totally amazing only to keep the recipe a secret so some asshat can’t just steal his idea. What a jerk. Yes yes the physicist is coming off all snakey oily and salesmanly. The guy is selling ice cream. There are plenty of naturally occurring aphrodisiacs. You shouldn’t judge the whole article just because you didn’t know, or don’t agree. I kinda go in the same direction with the safety testing if any synthetic ingredients are included, and transparency is not. But if the recipe is all natural with only food ingredients found in other foods, (including the accelerator elixir..) such as red cabbage, fructose, etc to mimic Ph paper, it’s be a different story…. Looks like red cabbage PH indicator at work. I don’t think Mr. Linares’ physics degree came into play in “inventing” this. “Love elixir” had better not be what it sounds like… mmmm love elixer,..
What's happening is:
doc.search('div.entry p')
returns a NodeSet, which is like an Array, containing the <p>
nodes. search
is one of several similar methods Nokogiri provides to find all matching nodes in a document. map(&:text)
walks that NodeSet and, for each element, returns the text, effectively returning each paragraph.Upvotes: 1
Reputation: 1259
Here's a long and hacky way to take what ruby-readability
gives you and get it closer to what you want. You'll need to test to see if it works with other articles you're looking to scrape.
Readability::Document.new(source, :blacklist => ".wp-caption-text", :tags => ["div","p"]).content.gsub("\n","").gsub("\r","").gsub("\t","").gsub(" ","").gsub("<div>","").gsub("</div>","").strip
Output:
=> "<p>Believe it or not, it is possible to make ice cream even better. Manuel Linares, a former physicist turned cook, has invented a variant on the classic treat that changes colors as you lick it.</p><p>The new creamy concoction called Xamaleón — an homage to “chameleon” — transitions from periwinkle to pink when it touches the tongue, and tastes similar to “tutti-frutti,” Phys.org reports. The ice cream’s colorful trick relies on both changes in temperature and reactions to acids in the human mouth. However, Linares isn’t revealing any more details about his secret recipe.</p><p>What we do know is the ice cream is made with natural ingredients like strawberries, banana, vanilla and almonds. Additionally, Linares sprays what he calls a “love elixir” on the ice cream after it’s scooped to help accelerate the reaction. We probably won’t know the whole story behind Xamaleón until Linares secures a patent for his creation, which is pending. </p><p>“As a physicist I know that there are various possibilities that might work and I was delighted when I managed to crack it and create an ice cream that changes color,” Linares said.</p><p>Earlier this year Linares opened an ice cream shop in Blanes, his hometown in Spain, and has plans for more exotic ice cream flavors in the future. Up next, he says: An ice cream made with Peruvian and African medicinal plants that will provide an aphrodisiac effect.</p>
I left in the <p>
tags for you so you can add line breaks or do whatever you want with them.
Upvotes: 1