Parsing XML with Python ElementTree with incorrect tags

Question

I am trying to use Python to parse an XML file to get the title, author, URL, and summary out of the XML feed. Then I ensure The XML where we are gathering the data is like this:




Our Site RSS

2013-08-14T20:05:08-04:00
urn:uuid:c60d7202-9a58-46a6-9fca-f804s879f5ebc

    Original content available for non-commercial use under a Creative
    Commons license (Attribution-NonCommercial-NoDerivs 3.0 Unported),
    except where noted.



    Headline #1
    
        John Smith
    
    
    1234
    2013-08-13T23:45:43-04:00

    
        Here is a summary of our story
    


    Headline #2
    
        John Smith
    
    
    1235
    2013-08-13T23:45:43-04:00

    
        Here is a summary of our second story

My code is:

import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()

for child in root:
    print child.tag

Instead of the tag being "entry" the tag is "{http://www.w3.org/2005/Atom}entry" when the Python print child.tag. I had tried to use:

for entry in root.findall('entry'):

But that doesn't work since the tag for entry includes the w3 url that is part of the root tag. Also, getting the grandchildren of root shows their tag as "{http://www.w3.org/2005/Atom}author"

I can't change the XML, but how can I modify it (setting the root just to ) and re-save it or alter my code so that root.findall('entry') works?

Joseph Dunn · Accepted Answer

This is standard ElementTree behavior. If the tags you're searching for are declared within a namespace, you have to specify that namespace when you search for those tags. However, you can do something like this:

import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()

def prepend_ns(s):
    return '{http://www.w3.org/2005/Atom}' + s

for entry in root.findall(prepend_ns('entry')):
    print 'Entry:'
    print '    Title: '   + entry.find(prepend_ns('title')).text
    print '    Author: '  + entry.find(prepend_ns('author')).find(prepend_ns('name')).text
    print '    URL: '     + entry.find(prepend_ns('link')).attrib['href']
    print '    Summary: ' + entry.find(prepend_ns('summary')).text

Parsing XML with Python ElementTree with incorrect tags

Answers (2)

Related Questions