Reputation: 35400
I need to parse thousands of feeds and performance is an essential requirement. Do you have any suggestions?
Thanks in advance!
Upvotes: 7
Views: 3855
Reputation: 3802
initially i used nokogiri to do some basic xml parsing, but it was slow and erratic (at times) i switched to feedzirra and not only was there a great performance boost, there were no errors and its as easy as pie. Example shown below
# fetching a single feed
feed = Feedzirra::Feed.fetch_and_parse("http://feeds.feedburner.com/PaulDixExplainsNothing")
# feed and entries accessors
feed.title # => "Paul Dix Explains Nothing"
feed.url # => "http://www.pauldix.net"
feed.feed_url # => "http://feeds.feedburner.com/PaulDixExplainsNothing"
feed.etag # => "GunxqnEP4NeYhrqq9TyVKTuDnh0"
feed.last_modified # => Sat Jan 31 17:58:16 -0500 2009 # it's a Time object
entry = feed.entries.first
entry.title # => "Ruby Http Client Library Performance"
entry.url # => "http://www.pauldix.net/2009/01/ruby-http-client-library-performance.html"
entry.author # => "Paul Dix"
entry.summary # => "..."
entry.content # => "..."
entry.published # => Thu Jan 29 17:00:19 UTC 2009 # it's a Time object
entry.categories # => ["...", "..."]
if you want to do more with the feeds, for example parsing them, the following will suffice
source = Feedzirra::Feed.fetch_and_parse(http://www.feed-url-you-want-to-play-with.com)
puts "Parsing Downloaded XML....\n\n\n"
source.entries.each do |entry|
begin
puts "#{entry.summary} \n\n"
cleanURL = (entry.url).gsub("+","%2B") #my own sanitization process, ignore
scrapArticleWithURL(cleanURL)
rescue
puts "(****)there has been an error fetching (#{entry.title}) \n\n"
end
Upvotes: 0
Reputation: 19528
When all you have is a hammer, everything looks like a nail. Consider a solution other than Ruby for this. Though I love Ruby and Rails and would not part with them for web development or perhaps for a domain specific language, I prefer heavy data lifting of the type you describe be performed in Java, or perhaps Python or even C++.
Given that the destination of this parsed data is likely a database it can act as the common point between the Rails portion of your solution and the other language portion. Then you're using the best tool to solve each of your problems and the result is likely easier to work on and truly meets your requirements.
If speed is truly of the essence, why add an additional constraint on there and say, "Oh, it's only of the essence as long as I get to use Ruby."
Upvotes: 1
Reputation: 173
You can use RFeedParser, a Ruby-port of (famous) Python Universal FeedParser. It's based on Hpricot, and it's really fast and easy to use.
http://rfeedparser.rubyforge.org/
An example:
require 'rubygems'
require 'rfeedparser'
require 'open-uri'
feed = FeedParser::parse(open('http://feeds.feedburner.com/engadget'))
feed.entries.each do |entry|
puts entry.title
end
Upvotes: 3
Reputation: 3502
I haven't tried it, but I read about Feedzirra recently (it claims to be built for performance) :-
Feedzirra is a feed library that is designed to get and update many feeds as quickly as possible. This includes using libcurl-multi through the taf2-curb gem for faster http gets, and libxml through nokogiri and sax-machine for faster parsing.
Upvotes: 10
Reputation: 978
Not sure about the performance, but a similar question was answered at Parsing Atom & RSS in Ruby/Rails?
You might also look into Hpricot, which parses XML but assumes that it's well-formed and doesn't do any validation.
http://wiki.github.com/why/hpricot http://wiki.github.com/why/hpricot/hpricot-xml
Upvotes: 0