Tomas Bruckner
Tomas Bruckner

Reputation: 728

XML parse into array in python

I have XML like this:

<?xml version="1.0" ?>
<iq id="123" to="test" type="result">
    <query xmlns="jabber:iq:roster">
        <item jid="foo" subscription="both"/>
        <item jid="bar" subscription="both"/>
    </query>
</iq>

And I would like to parse jid from item into array. I thought something like this would work

import xml.etree.ElementTree as ET

myarr = []

xml = '<?xml version="1.0" ?><iq id="123" to="test" type="result"><query xmlns="jabber:iq:roster"><item jid="foo" subscription="both"/><item jid="bar" subscription="both"/></query></iq>'

root = ET.fromstring(xml)

for item in root.findall('query'):
    t = item.get('jid')
    myarr.append(t)
    print (t)

Upvotes: 0

Views: 2711

Answers (2)

Jonathan Eunice
Jonathan Eunice

Reputation: 22453

I endorse @alecxe's approach, which I will label "handle the namespaces." That is the most general and correct approach. Unfortunately, namespaces are often ugly, wordy, and they needlessly complexity XPath expressions.

For the many simple cases where namespaces are an artifact of the XML world's desire for über-precision and not truly necessary to identify the nodes in a document, a simpler "eliminate the namespaces" alternative allows more concise searches. The key routine is:

def strip_namespaces(tree):
    """
    Strip the namespaces from an ElementTree in order to make
    processing easier. Adapted from @nonagon's answer
    at http://stackoverflow.com/a/25920989/240490
    """
    for el in tree.iter():
        if '}' in el.tag:
            el.tag = el.tag.split('}', 1)[1]  # strip namespaces
        for k, v in el.attrib.items():
            if '}' in k:
                newkey = k.split('}', 1)[1]
                el.attrib[newkey] = v
            del el.attrib[k]
    return tree

Then the program continues much as before, but without worrying about those pesky namespaces:

root = ET.fromstring(xml)
strip_namespaces(root)

for item in root.findall('.//item'):
    t = item.attrib.get('jid')
    myarr.append(t)
    print (t)

This is not effective if you are trying to modify the ElementTree and re-emit XML, but if you're just trying to deconstruct and grab data from the tree, it works well.

Upvotes: 1

alecxe
alecxe

Reputation: 473873

You need to handle namespaces. One option would to paste the namespace into the xpath expression:

for item in root.findall('.//{%(ns)s}query/{%(ns)s}item' % {'ns': 'jabber:iq:roster'}):
    t = item.attrib.get('jid')
    myarr.append(t)
    print (t)

Prints:

foo
bar

See also:

Upvotes: 1

Related Questions