Reputation: 3132
I am trying to scrape an xml website and get contents from it.
class PageScraper
def get_page_details
if xml_data
#get the info from xml website
else
#get it from html website
end
end
def get_xml_details
if xml_data
#get it from xml website
end
end
def xml_data
xml_url = www.abcd.xml
#Download and parse the xml data from abcd.xml site using nokogiri-gem
end
end
Here, there are other methods which need to get the xml_data
method. Every time, it needs to go and fetch and download data from xml website.
Is there any way to store the xml data in a variable (like @data = xml_data()
) first time it is called and return the downloaded xml_data
? In the next subsequent call to xml_data
, it should be able to refer @data
, which is cached.
Upvotes: 0
Views: 39
Reputation: 160611
Why aren't you using OpenURI and Nokogiri? The normal process of retrieving and parsing the XML will do what you're wanting to do. The Nokogiri site is full of examples.
As far as your class goes, you probably need a method to retrieve the page, which will also store it in an instance or class variable, which is your choice depending on whether the class is responsible for multiple pages or only one.
As an example, here's some code for parsing HTML, which is almost identical to what would be done for parsing XML. The only real difference would be using Nokogiri::XML
instead of Nokogiri::HTML
:
require 'open-uri'
require 'nokogiri'
class PageScraper
def initialize(url)
@source = open(url).read
@dom = Nokogiri::HTML(@source)
end
def errors?
[email protected]?
end
def title
@dom.title
end
def head
@dom.at('head')
end
def body
@dom.at('body')
end
end
Of course you'd change the accessors for various elements like head
and body
to match your particular use-case.
After running that, both the HTML (or XML) and the parsed HTML/XML DOM would be available as instance variables, allowing you to easily refer to either. It's really not necessary to have @source
since it can be recovered using @dom.to_xml
or @dom.to_html
, unless there are errors in the source, in which case Nokogiri will try to fix up the document possibly causing it to differ from the original.
It'd be used something like:
page_scraper = PageScraper('http://www.example.com')
abort "HTML errors found" if page_scraper.errors?
page_title_text = page_scraper.title.text
page_scraper.title.text = 'Foo bar'
page_css = page_scraper.head.at('style').text
Upvotes: 1