collimarco
collimarco

Reputation: 35400

High-performance RSS/Atom parsing with Ruby on Rails

I need to parse thousands of feeds and performance is an essential requirement. Do you have any suggestions?

Thanks in advance!

Upvotes: 7

Views: 3855

Answers (5)

Edwin
Edwin

Reputation: 3802

initially i used nokogiri to do some basic xml parsing, but it was slow and erratic (at times) i switched to feedzirra and not only was there a great performance boost, there were no errors and its as easy as pie. Example shown below

# fetching a single feed
feed = Feedzirra::Feed.fetch_and_parse("http://feeds.feedburner.com/PaulDixExplainsNothing")

# feed and entries accessors
feed.title          # => "Paul Dix Explains Nothing"
feed.url            # => "http://www.pauldix.net"
feed.feed_url       # => "http://feeds.feedburner.com/PaulDixExplainsNothing"
feed.etag           # => "GunxqnEP4NeYhrqq9TyVKTuDnh0"
feed.last_modified  # => Sat Jan 31 17:58:16 -0500 2009 # it's a Time object

entry = feed.entries.first
entry.title      # => "Ruby Http Client Library Performance"
entry.url        # => "http://www.pauldix.net/2009/01/ruby-http-client-library-performance.html"
entry.author     # => "Paul Dix"
entry.summary    # => "..."
entry.content    # => "..."
entry.published  # => Thu Jan 29 17:00:19 UTC 2009 # it's a Time object
entry.categories # => ["...", "..."]

if you want to do more with the feeds, for example parsing them, the following will suffice

source = Feedzirra::Feed.fetch_and_parse(http://www.feed-url-you-want-to-play-with.com)
  puts "Parsing Downloaded XML....\n\n\n"

  source.entries.each do |entry|

    begin
      puts "#{entry.summary} \n\n"
      cleanURL = (entry.url).gsub("+","%2B")  #my own sanitization process, ignore
      scrapArticleWithURL(cleanURL)
  rescue
    puts "(****)there has been an error fetching (#{entry.title}) \n\n"
  end

Upvotes: 0

John Munsch
John Munsch

Reputation: 19528

When all you have is a hammer, everything looks like a nail. Consider a solution other than Ruby for this. Though I love Ruby and Rails and would not part with them for web development or perhaps for a domain specific language, I prefer heavy data lifting of the type you describe be performed in Java, or perhaps Python or even C++.

Given that the destination of this parsed data is likely a database it can act as the common point between the Rails portion of your solution and the other language portion. Then you're using the best tool to solve each of your problems and the result is likely easier to work on and truly meets your requirements.

If speed is truly of the essence, why add an additional constraint on there and say, "Oh, it's only of the essence as long as I get to use Ruby."

Upvotes: 1

Héctor Vergara
Héctor Vergara

Reputation: 173

You can use RFeedParser, a Ruby-port of (famous) Python Universal FeedParser. It's based on Hpricot, and it's really fast and easy to use.

http://rfeedparser.rubyforge.org/

An example:

require 'rubygems'
require 'rfeedparser'
require 'open-uri'

feed = FeedParser::parse(open('http://feeds.feedburner.com/engadget'))

feed.entries.each do |entry|
  puts entry.title
end

Upvotes: 3

James Mead
James Mead

Reputation: 3502

I haven't tried it, but I read about Feedzirra recently (it claims to be built for performance) :-

Feedzirra is a feed library that is designed to get and update many feeds as quickly as possible. This includes using libcurl-multi through the taf2-curb gem for faster http gets, and libxml through nokogiri and sax-machine for faster parsing.

Upvotes: 10

Jeremy Weiskotten
Jeremy Weiskotten

Reputation: 978

Not sure about the performance, but a similar question was answered at Parsing Atom & RSS in Ruby/Rails?

You might also look into Hpricot, which parses XML but assumes that it's well-formed and doesn't do any validation.

http://wiki.github.com/why/hpricot http://wiki.github.com/why/hpricot/hpricot-xml

Upvotes: 0

Related Questions