Reputation: 7007
as titled, am trying to build a small application that will aggregate RSS from different blogs. Am trying to test out and explore feedparser for this operation, am stuck though trying to write a peace of code that would detect the rss feed.
Most people would just enter www.mysite.com/blog which is not exactly the URL to the RSS feed. If there a way for me to detect the RSS feed, am trying to replicate the browser behavior where it can see the RSS URL.
any ideas?
Upvotes: 0
Views: 792
Reputation: 501
There is a great app exactly for this, is called Feedjack
But you will find yourself banging your head to wall when the RSS feed will contain less than 100 chars.
For full control (aggregating exactly what you need) and for websites without any RSS feeds I would recommend Scrapy
Upvotes: 0
Reputation: 239470
Use something like BeautifulSoup to parse the HTML document and look for the RSS feeds. The following is a basic example and not necessarily the most efficient:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
rss_links = soup.select('link[type="application/rss+xml"]')
for link in rss_links:
rss_url = link.get('href')
See the full BeautifulSoup documentation.
Upvotes: 1
Reputation: 1125398
Browsers use RSS feed auto-discovery and Atom feed auto-discovery to find feeds on a given web page.
For example, the django question lists are available via an Atom feed which is linked in the HTML header of the associated pages with:
<link rel="alternate" type="application/atom+xml" title="Feed of questions tagged python" href="/feeds/tag/python" />
You'll need to parse out the <link rel="alternate">
tags in a given page to discover these; anything with an application/atom+xml
or application/rss+xml
type fits.
Upvotes: 1