hDan
hDan

Reputation: 467

Universal Feed Parser issue

I am working on a python script to parse RSS links.

I use the Universal Feed Parser and I am encountering issues with some links, for example while trying to parse the FreeBSD Security Advisories Here is the sample code:

    feed = feedparser.parse(url)
    items = feed["items"]

Basically the feed["items"] should return all the entries in the feed, the fields that start with item, but it always returns empty.

I can also confirm that the following links are parsed as expected:

Is this a issue with the feeds, in that the ones from FreeBSD do nor respect the standard ?

EDIT:

I am using python 2.7. I ended up using feedparser, in combination with BeautifulSoup, like Hai Vu proposed. Here is the sample code I ended up with, slightly changed:

def rss_get_items_feedparser(self, webData):
    feed = feedparser.parse(webData)
    items = feed["items"]
    return items

def rss_get_items_beautifulSoup(self, webData):
    soup = BeautifulSoup(webData)
    for item_node in soup.find_all('item'):
        item = {}
        for subitem_node in item_node.findChildren():
            if subitem_node.name is not None:
                item[str(subitem_node.name)] = str(subitem_node.contents[0])
        yield item

def rss_get_items(self, webData):
    items = self.rss_get_items_feedparser(webData)
    if (len(items) > 0):
        return items;
    return self.rss_get_items_beautifulSoup(webData)

def parse(self, url):
        request = urllib2.Request(url)
        response = urllib2.urlopen(request)
        webData = response .read()
        for item in self.rss_get_items(webData):
            #parse items

I also tried passing the response directly to rss_get_items, without reading it, but it throws and exception, when BeautifulSoup tries to read:

  File "bs4/__init__.py", line 161, in __init__
    markup = markup.read()
TypeError: 'NoneType' object is not callable        

Upvotes: 0

Views: 1301

Answers (1)

Hai Vu
Hai Vu

Reputation: 40688

I found out the problem was with the use of namespace.

for FreeBSD's RSS feed:

<rss xmlns:atom="http://www.w3.org/2005/Atom" 
     xmlns="http://www.w3.org/1999/xhtml" 
     version="2.0">

For Ubuntu's feed:

<rss xmlns:atom="http://www.w3.org/2005/Atom" 
     version="2.0">

When I remove the extra namespace declaration from FreeBSD's feed, everything works as expected.

So what does it means for you? I can think of a couple of different approaches:

  1. Use something else, such as BeautifulSoup. I tried it and it seems to work.
  2. Download the whole RSS feed, apply some search/replace to fix up the namespaces, then use feedparser.parse() afterward. This approach is a big hack; I would not use it myself.

Update

Here is a sample code for rss_get_items() which will returns you a list of items from an RSS feed. Each item is a dictionary with some standard keys such as title, pubdate, link, and guid.

from bs4 import BeautifulSoup
import urllib2

def rss_get_items(url):    
    request = urllib2.Request(url)
    response = urllib2.urlopen(request)
    soup = BeautifulSoup(response)

    for item_node in soup.find_all('item'):
        item = {}
        for subitem_node in item_node.findChildren():
            key = subitem_node.name
            value = subitem_node.text
            item[key] = value
        yield item

if __name__ == '__main__':
    url = 'http://www.freebsd.org/security/rss.xml'
    for item in rss_get_items(url):
        print item['title']
        print item['pubdate']
        print item['link']
        print item['guid']
        print '---'

Output:

FreeBSD-SA-14:04.bind
Tue, 14 Jan 2014 00:00:00 PST
http://security.FreeBSD.org/advisories/FreeBSD-SA-14:04.bind.asc
http://security.FreeBSD.org/advisories/FreeBSD-SA-14:04.bind.asc
---
FreeBSD-SA-14:03.openssl
Tue, 14 Jan 2014 00:00:00 PST
http://security.FreeBSD.org/advisories/FreeBSD-SA-14:03.openssl.asc
http://security.FreeBSD.org/advisories/FreeBSD-SA-14:03.openssl.asc
---
...

Notes:

  • I omit error checking for sake of brevity.
  • I recommend only using the BeautifulSoup API when feedparser fails. The reason is feedparser is the right tool the the job. Hopefully, they will update it to be more forgiving in the future.

Upvotes: 1

Related Questions