Parsing XML RSS feed byte stream for tag

Question

I'm attempting to parse an RSS feed for the first instance of an element "".

def pageReader(url):
try:
    readPage = urllib2.urlopen(url)
except urllib2.URLError, e:
#   print 'We failed to reach a server.'
#   print 'Reason: ', e.reason
    return 404  
except urllib2.HTTPError, e:
#   print('The server couldn\'t fulfill the request.')
#   print('Error code: ', e.code)   
    return 404  
else:
    outputPage = readPage.read()        
return outputPage

Assume arguments being passed are correct. The function returns a str object whose value is simply an entire rss feed - I've confirmed the type with:

a = isinstance(value, str)
if not a:
   return -1

So, an entire rss feed has been returned from the function call, it's this point I hit a brick wall - I've tried parsing the feed with BeautifulSoup, lxml and various other libs, but no success (I had some success with BeautifulSoup, but it wasn't able to pull certain child elements from the parent, for example, . I'm just about ready to resort to writing my own parser, but I'd like to know if anybody has any suggestions.

To recreate my error, simply call the above function with an argument similar to:

http://www.cert.org/nav/cert_announcements.rss

You'll see I'm trying to return the first child.


New Blog Entry: Common Sense Guide to Mitigating Insider Threats - Best Practice 16 (of 19)
http://www.cert.org/blogs/insider_threat/2013/02/common_sense_guide_to_mitigating_insider_threats_-_best_practice_16_of_19.html
This sixteenth of 19 blog posts about the fourth edition of the Common   Sense Guide to Mitigating Insider Threats describes Practice 16: Develop a formalized insider threat program.
Wed, 06 Feb 2013 06:38:07 -0500

As I've said, BeautifulSoup fails to find both pubDate and Link, which are crucial to my app.

Any advice would be greatly appreciated.

That1Guy · Accepted Answer

I had some success using BeautifulStoneSoup and passing lowercase tags like so:

from BeautifulSoup import BeautifulStoneSoup
xml = 'New Blog Entry: Common Sense Guide to Mitigating Insider Threats - Best Practice 16 (of 19)http://www.cert.org/blogs/insider_threat/2013/02/common_sense_guide_to_mitigating_insider_threats_-_best_practice_16_of_19.htmlThis sixteenth of 19 blog posts about the fourth edition of the Common   Sense Guide to Mitigating Insider Threats describes Practice 16: Develop a formalized insider threat program.Wed, 06 Feb 2013 06:38:07 -0500'


soup = BeautifulStoneSoup(xml)
item = soup('item')[0]
print item('pubdate'), item('link')

Parsing XML RSS feed byte stream for <item> tag

Answers (1)

Related Questions

Parsing XML RSS feed byte stream for &lt;item&gt; tag

Answers (1)

Related Questions

Parsing XML RSS feed byte stream for <item> tag