Reputation: 179
I have the following code to parse an XML but it just won't let me iterate through the children:
import urllib, urllib2, re, time, os
import xml.etree.ElementTree as ET
def wgetUrl(target):
try:
req = urllib2.Request(target)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3 Gecko/2008092417 Firefox/3.0.3')
response = urllib2.urlopen(req)
outtxt = response.read()
response.close()
except:
return ''
return outtxt
newUrl = 'http://feeds.rasset.ie/rteavgen/player/playlist?showId=10056467'
data = wgetUrl(newUrl)
tree = ET.fromstring(data)
#tree = ET.parse(data)
for elem in tree.iter('entry'):
print elem.tag, elem.attrib
Now, If I remove 'entry' from the iter I get an output like this (Why the URL??):
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}id {}
{http://www.w3.org/2005/Atom}published {}
{http://www.w3.org/2005/Atom}updated {}
{http://www.w3.org/2005/Atom}title {'type': 'text'}
But, If I put the iter statement like this it still does not find the children to entry:
for elem in tree.iter('{http://www.w3.org/2005/Atom}entry'):
print elem.tag, elem.attrib
I still only get the entry element on it's own, not the children:
{http://www.w3.org/2005/Atom}entry {}
Any idea what I am doing wrong?
I have searched everywhere but can't figure this out... I am new to all this so sorry if it is something stupid.
Upvotes: 0
Views: 3485
Reputation: 1121266
If you are parsing a Atom feed, you really want to use the feedparser
library instead, which takes care of all these details for you and many more.
The {http://www.w3.org/2005/Atom}
part is a namespace. You need to specify that namespace to select the entry
tags:
for elem in tree.iterfind('ns:entry', {'ns': 'http://www.w3.org/2005/Atom'}):
where I used a dictionary to map the ns:
prefix to the namespace, or you can use the same curly braces syntax:
for elem in tree.iterfind('{http://www.w3.org/2005/Atom}entry'):
Once you have the element, you still need to explicitly find it's children:
for elem in tree.iterfind('{http://www.w3.org/2005/Atom}entry'):
for child in elem:
print child
Upvotes: 1