Gattster
Gattster

Reputation: 4781

Extract items list from XML in python

In python, what is the best way to extract the list of items from the following xml?

<iq xmlns="jabber:client" to="__anonymous__admin@localhost/8978528613056092673206" 
 from="conference.localhost" id="disco" type="result">
    <query xmlns="http://jabber.org/protocol/disco#items">
        <item jid="[email protected]" name="pgatt (1)"/>
        <item jid="[email protected]" name="pgatt (1)"/>
    </query>
</iq>

I usually use lxml with xpath, but it's not working in this case. I think my problems are due to namespaces. I'm not set on lxml and am open to using any library.

I would like a solution that is robust enough to fail if the general structure of the xml changes.

Upvotes: 1

Views: 1008

Answers (3)

MattH
MattH

Reputation: 38247

I've missed the boat, but here's how you do it while caring about namespaces.

You can either spell them all out in the query, or make yourself a namespace map which you pass to the xpath query.

from lxml import etree

data = """<iq xmlns="jabber:client" to="__anonymous__admin@localhost/8978528613056092673206"
 from="conference.localhost" id="disco" type="result">
    <query xmlns="http://jabber.org/protocol/disco#items">
        <item jid="[email protected]" name="pgatt (1)"/>
        <item jid="[email protected]" name="pgatt (1)"/>
    </query>
</iq>"""

nsmap = {
  'jc': "jabber:client",
  'di':"http://jabber.org/protocol/disco#items"
}

doc = etree.XML(data)

for item in doc.xpath('//jc:iq/di:query/di:item',namespaces=nsmap):
  print etree.tostring(item).strip()
  print "Name: %s\nJabberID: %s\n" % (item.attrib.get('name'),item.attrib.get('jid'))

Produces:

<item xmlns="http://jabber.org/protocol/disco#items" jid="[email protected]" name="pgatt (1)"/>
Name: pgatt (1)
JabberID: [email protected]

<item xmlns="http://jabber.org/protocol/disco#items" jid="[email protected]" name="pgatt (1)"/>
Name: pgatt (1)
JabberID: [email protected]

Upvotes: 0

D.Shawley
D.Shawley

Reputation: 59563

I'm not sure about lxml but you can use an expression like //*[local-name()="item"] to pull out the item elements regardless of their namespace.

You might also want to take a look at Amara for XML processing.

>>> import amara.bindery
>>> doc = amara.bindery.parse(
...     '''<iq xmlns="jabber:client" 
...          to="__anonymous__admin@localhost/8978528613056092673206"
...          from="conference.localhost" id="disco" type="result">
...          <query xmlns="http://jabber.org/protocol/disco#items">
...            <item jid="[email protected]" name="pgatt (1)"/>
...            <item jid="[email protected]" name="pgatt (1)"/>
...          </query>
...        </iq>''')
>>> for item in doc.iq.query.item:
...   print item.jid, item.name
...
[email protected] pgatt (1)
[email protected] pgatt (1)
>>>

Once I discovered Amara, I would never consider processing XML any other way.

Upvotes: 1

Bryce Siedschlaw
Bryce Siedschlaw

Reputation: 4226

I answered a similar question earlier about how to parse and search through xml data.

Full text searching XML data with Python: best practices, pros & cons

You'll want to look at the xml2json function. The function expects a minidom object. This is how I got my xml, not sure how you do it.

from xml.dom import minidom
x = minidom.parse(urllib.urlopen(url))
json = xml2json(x)

Or if you use a string and not a url:

x = minidom.parseString(xml_string)
json = xml2json(x)

The xml2json function will then return a dictionary with all values found in the xml. You may have to try it out and print the output to see what the layout looks like.

Upvotes: 1

Related Questions