Reputation: 467
I am working on a python script to parse RSS links.
I use the Universal Feed Parser and I am encountering issues with some links, for example while trying to parse the FreeBSD Security Advisories Here is the sample code:
feed = feedparser.parse(url)
items = feed["items"]
Basically the feed["items"] should return all the entries in the feed, the fields that start with item, but it always returns empty.
I can also confirm that the following links are parsed as expected:
Is this a issue with the feeds, in that the ones from FreeBSD do nor respect the standard ?
EDIT:
I am using python 2.7. I ended up using feedparser, in combination with BeautifulSoup, like Hai Vu proposed. Here is the sample code I ended up with, slightly changed:
def rss_get_items_feedparser(self, webData):
feed = feedparser.parse(webData)
items = feed["items"]
return items
def rss_get_items_beautifulSoup(self, webData):
soup = BeautifulSoup(webData)
for item_node in soup.find_all('item'):
item = {}
for subitem_node in item_node.findChildren():
if subitem_node.name is not None:
item[str(subitem_node.name)] = str(subitem_node.contents[0])
yield item
def rss_get_items(self, webData):
items = self.rss_get_items_feedparser(webData)
if (len(items) > 0):
return items;
return self.rss_get_items_beautifulSoup(webData)
def parse(self, url):
request = urllib2.Request(url)
response = urllib2.urlopen(request)
webData = response .read()
for item in self.rss_get_items(webData):
#parse items
I also tried passing the response directly to rss_get_items, without reading it, but it throws and exception, when BeautifulSoup tries to read:
File "bs4/__init__.py", line 161, in __init__
markup = markup.read()
TypeError: 'NoneType' object is not callable
Upvotes: 0
Views: 1301
Reputation: 40688
I found out the problem was with the use of namespace.
for FreeBSD's RSS feed:
<rss xmlns:atom="http://www.w3.org/2005/Atom"
xmlns="http://www.w3.org/1999/xhtml"
version="2.0">
For Ubuntu's feed:
<rss xmlns:atom="http://www.w3.org/2005/Atom"
version="2.0">
When I remove the extra namespace declaration from FreeBSD's feed, everything works as expected.
So what does it means for you? I can think of a couple of different approaches:
feedparser.parse()
afterward. This approach is a big hack; I would not use it myself.Here is a sample code for rss_get_items()
which will returns you a list of items from an RSS feed. Each item is a dictionary with some standard keys such as title, pubdate, link, and guid.
from bs4 import BeautifulSoup
import urllib2
def rss_get_items(url):
request = urllib2.Request(url)
response = urllib2.urlopen(request)
soup = BeautifulSoup(response)
for item_node in soup.find_all('item'):
item = {}
for subitem_node in item_node.findChildren():
key = subitem_node.name
value = subitem_node.text
item[key] = value
yield item
if __name__ == '__main__':
url = 'http://www.freebsd.org/security/rss.xml'
for item in rss_get_items(url):
print item['title']
print item['pubdate']
print item['link']
print item['guid']
print '---'
Output:
FreeBSD-SA-14:04.bind
Tue, 14 Jan 2014 00:00:00 PST
http://security.FreeBSD.org/advisories/FreeBSD-SA-14:04.bind.asc
http://security.FreeBSD.org/advisories/FreeBSD-SA-14:04.bind.asc
---
FreeBSD-SA-14:03.openssl
Tue, 14 Jan 2014 00:00:00 PST
http://security.FreeBSD.org/advisories/FreeBSD-SA-14:03.openssl.asc
http://security.FreeBSD.org/advisories/FreeBSD-SA-14:03.openssl.asc
---
...
Notes:
BeautifulSoup
API when feedparser
fails. The reason is feedparser
is the right tool the the job. Hopefully, they will update it to be more forgiving in the future.Upvotes: 1