Reputation: 1521

How Do I Fetch All Old Items on an RSS Feed?

I've been experimenting with writing my own RSS reader. I can handle the "parse XML" bit. The thing I'm getting stuck on is "How do I fetch older posts?"

Most RSS feeds only list the 10-25 most recent items in their XML file. How do I get ALL the items in a feed, and not just the most recent ones?

The only solution I could find was using the "unofficial" Google Reader API, which would be something like

http://www.google.com/reader/atom/feed/http://fskrealityguide.blogspot.com/feeds/posts/default?n=1000

I don't want to make my application dependent on Google Reader.

Is there any better way? I noticed that on Blogger, I can do "?start-index=1&max-results=1000", and on WordPress I can do "?paged=5". Is there any general way to fetch an RSS feed so that it gives me everything, and not just the most recent items?

Upvotes: 139

Answers (8)

At0micMutex

Reputation: 148

I just wanted to share my list of (concrete) steps:

As Alex suggests, I used waybackpack to download all of the archived rss feeds:

waybackpack --uniques-only --progress $URL -d dir/
Then I used cat to combine all the feeds:

cat dir/*/website/path/to/rss.xml >> rssfeed.xml
I then opened the concatenated file in TextWranger and used find-and-replace in grep mode to replace all the text between the last feed <\item> and the start of the next feed with a newline. You may need to play around with this depending on how your RSS feed is formatted, but I came up with <\/channel>.*(\n*|.*|\s*)*.*<\/description>. I placed part of the RSS feed containing the beginning and end of each feed into https://regex101.com/ to help me get the correct syntax. If you've done it correctly, the number of results from your regex should equal the number of results of finding the text <?xml minus one.

I uploaded my RSS feed to a simple local server and NetNewsWire was able to pick it up, and ignored any duplicate entries in the feed, so I didn't bother trying to remove duplicate entries myself.

Upvotes: 1

David Dean

Reputation: 7701

RSS/Atom feeds does not allow for historic information to be retrieved. It is up to the publisher of the feed to provide it if they want such as in the blogger or wordpress examples you gave above.

The only reason that Google Reader has more information is that it remembered it from when it came up the first time.

There is some information on something like this talked about as an extension to the ATOM protocol, but I don't know if it is actually implemented anywhere.

Upvotes: 73

Pranav Kasetti

Reputation: 9935

Why does this problem exist?

Most RSS readers need to import feeds through a live URL, which makes things harder for sites that are unindexed on Wayback Machine.

The reason why Wayback Machine feeds can be imported is that the reader can regularly poll the server for updates according to its defined TTL configuration. The reader compares the current datetime with the RSS feed posts pubDate or lastBuildDate keys in the XML response. We can't hack the machine datetime to work around the datetime resolution because the current datetime is fetched live.

I've outlined an alternative solution without Wayback below. Unfortunately, I have not been able to find a universal solution for all feed sources.

Alternative Solution(s)

In my experience, NOT ALL feeds are partial though. The XML doesn't have to specify the datetime of each post. This means the RSS Reader doesn't have a datetime to filter the feed with. An example of this feed type can be found here.

This kind of reading experience is useful when chronological order is irrelevant, and the content doesn't need to be sorted. This approach is useful for sites where ALL the content is valuable, and the linked Essays of Paul Graham is a good example.

If the site has a generic, non-chronological feed option, subscribe to that RSS instead (the preferred option).
Download the linked timestamped .rss file, strip datetimes and host the file on your own server. Note, we can implement this via an AWS Lambda.
1. Set up a server that fetches the RSS from live.
2. Strip the pubDate tags from the XML file on fetch.
3. Host the modified RSS on your own server.

Note

These are suboptimal solutions due to loss of orders, however, I wanted to provide a potential alternative to WaybackMachine.

In addition, some existing answers require advanced SysDesign workarounds, more prework and in some cases are outdated (Google Reader is shut down). I hope it's helpful for those who really need a solution for a complete feed list. Constructing new RSS feeds is not too hard from the original RSS file.

Upvotes: 0

Axel Beckert

Reputation: 7655

All previous answers more or less relied on existing services to still have a copy of that feed or the feed engine to be able to provide older items dynamically.

There's though another, admittedly pro-active and rather theoretical way to do so: Let your feedreader use a caching proxy which semantically understands RSS and/or Atom feeds and caches them on a per-item base up to as many items as you configure.

If the feedreader doesn't poll feeds regularily, the proxy could fetch known feeds time-based on its own to not miss an item in highly volatile feeds like the one from User Friendly which has only one item and changes every day (or at least used to do so). Hence if the feedreadere.g. crashed or lost network connection while you are away for a few days, you might loose items in your feedreader's cache. Having the proxy to fetch those feeds regularily (e.g. from a data center instead from at home or on a server instead of a laptop) allows you to easily run the feedreader only then and when without loosing items which were posted after your feedreader fetched feeds the last time but rotated out again before you fetch them the next time.

I call that concept a Semantic Feed Proxy and I've implemented a proof of concept implementation called sfp. It's though not much more than a proof of concept and I haven't developed it further. (So I'd be happy about hints to projects with similar ideas or purposes. :-)

Upvotes: 2

Alex Klibisz

Reputation: 1323

Another potential solution that might not have been available when the question was originally asked and shouldn't require any specific service.

Find the URL of the RSS feed you want and use waybackpack to get the archived urls for that feed.
Use FeedReader or a similar library to pull down the archived RSS feed.
Take the URLs from each feed and scrape them as you wish. If you're going way back in time it's possible there might be some dead links.

Upvotes: 6

Quinn Comendant

Reputation: 10576

As the other replies here mentioned, a feed may not provide archival data but historical items may be available from another source.

Archive.org’s Wayback Machine has an API to access historical content, including RSS feeds (if their bots have downloaded it). I’ve created the web tool Backfeed that uses this API to regenerate a feed containing concatenated historical items. If you'd like to discuss the implementation in detail please get in touch.

Upvotes: 15

Seán O'Sullivan

Reputation: 99

Further to what David Dean said the RSS/Atom feeds will only contain what the publisher of the feed has up at that moment and someone would need to be actively collecting this informaton in order to have any historical information. Basically Google Reader was doing this for free and when you interacted with it you could retrieve this stored informaton from the google database servers.

Now that they have retired the service, to my knowledge you have two choices. You either have to start collection of this information from your feeds of interest and store the data using XML or some such, or you could pay for this data from one of the companies who sell this type of archived feed information.

I hope this information helps somebody.

Seán

Upvotes: 9

Rob Haupt

Reputation: 2154

In my experience with RSS, the feed is compiled by the last X items where X is a variable. Certain Feeds may have the full list, but for bandwidth sake most places are likely limiting to just the last few items.

The likely answer for google reader having the old info, is that it is storing it on its side for users later.