histelheim
histelheim

Reputation: 5088

How to consume streaming XML (RSS feeds) with R?

I understand somewhat how to use the XML package to read and parse an XML file, such as a piece of an RSS feed. However, what is the basic setup for continuously reading an RSS feed?

For example, imagine that I want to set up a facility that continuously reads the feed from http://evemaps.dotlan.net/feed/sovereignty and stores the data in some kind of R data structure (say, a data.frame). I imagine that I would need to do something like the following:

  1. Set up R on a server (e.g. RStudio Server on an AWS instance)
  2. Open a HTTP connection to the rss feed
  3. Continuously read and parse distinct bits of the feed and add them to a data.frame which grows by each entry added

However, this is still a rather vague pictures. What are the basic packages and functions that I would need to string together to make this work? Meaning: what are the basic steps that I would need to put in place to create such a facility? I'm not looking for anyone to write this facility for me (even though that would be nice!). Rather, I'm trying to understand which overall steps are involved.

Upvotes: 0

Views: 924

Answers (1)

nootrope
nootrope

Reputation: 887

I think you're looking for .

With an RSS client (i.e., your R application on AWS) you have 2 choices: polling or PubSubHubbub (aka webhooks, PuSH, and others). As mentioned here, with polling you may be throttled after exceeding some publisher's maximum-pings policy. With PuSH the publisher's server notifies your R application in realtime when there is a new update because it works as a subscription.

The SO answer linked above leads to the blog of popular pay-as-you-go hub provider, Superfeedr, and a post which describes the PuSH protocol's workflow and shows a command line implementation.

You can hear more about the protocol from this Google IO 2010 presentation by one of the engineers who crafted PuSH.

Upvotes: 1

Related Questions