Tristan Tran
Tristan Tran

Reputation: 1513

Find element that has unknown namespace in lxml

I have an XML with many levels. Each level may have namespace attached to it. I want to find a specific element whose name I know, but not its namespace. For example:

my_file.xml

<?xml version="1.0" encoding="UTF-8"?>
<data xmlns="aaa:bbb:ccc:ddd:eee">
  <country name="Liechtenstein" xmlns="aaa:bbb:ccc:liechtenstein:eee">
    <rank updated="yes">2</rank>
    <year>2008</year>
    <gdppc>141100</gdppc>
    <neighbor name="Austria" direction="E"/>
    <neighbor name="Switzerland" direction="W"/>
  </country>
  <country name="Singapore" xmlns="aaa:bbb:ccc:singapore:eee">
    <continent>Asia</continent>
    <holidays>
      <christmas>Yes</christmas>
    </holidays>
    <rank updated="yes">5</rank>
    <year>2011</year>
    <gdppc>59900</gdppc>
    <neighbor name="Malaysia" direction="N"/>
  </country>
  <country name="Panama" xmlns="aaa:bbb:ccc:panama:eee">
    <rank updated="yes">69</rank>
    <year>2011</year>
    <gdppc>13600</gdppc>
    <neighbor name="Costa Rica" direction="W"/>
    <neighbor name="Colombia" direction="E"/>
  </country>
</data>
import lxml.etree as etree

tree = etree.parse('my_file.xml')
root = tree.getroot()

cntry_node = root.find('.//country')

The find above does not return anything to cntry_node. In my real data, the levels are deeper than this example. The lxml document talks about namespace. When I do this:

root.nsmap

I see this:

{None: 'aaa:bbb:ccc:ddd:eee'}

If someone could explain how to access the full nsmap and/or how to use it to find a specific element? Thanks very much.

Upvotes: 5

Views: 1930

Answers (3)

Mathias M&#252;ller
Mathias M&#252;ller

Reputation: 22617

nsmap is not a global collection of all namespaces of an XML document

I believe your impression was that nsmap is a collection of all namespaces that are present in an XML document. And that this collection would be available after parsing the document. That is not the case.

nsmap gives you access to the namespace definitions of one element only. So this:

root = tree.getroot()
root.nsmap

Gives you the namespace definitions known in the context of the root element. Keep in mind that "root" is just the name of a Python variable and in fact contains the outermost element of your XML document (I know this because you called getroot()). The outermost element of your document is:

<data xmlns="aaa:bbb:ccc:ddd:eee">

so it is expected that its nsmap would contain

{None: 'aaa:bbb:ccc:ddd:eee'}

(The nsmap has None in it because this is a default namespace without a namespace prefix that would go where the None is.)

XML document has a terrible structure

Usually, the best way to deal with namespaces is to define them yourself (without taking them from the input document). Suppose we would like to find the following element:

<country name="Liechtenstein" xmlns="aaa:bbb:ccc:liechtenstein:eee">

This country element is in the default namespace with the namespace URI "aaa:bbb:ccc:liechtenstein:eee". To find it with lxml, define a mapping:

my_own_namespace_mapping = {'prefix': 'aaa:bbb:ccc:liechtenstein:eee'}

and then use it when retrieving nodes:

root.xpath('.//prefix:country', namespaces=my_own_namespace_mapping)
[<Element {aaa:bbb:ccc:liechtenstein:eee}country at 0x7fea87f363f8>]

However, in the case of your input document it appears you would need to do that separately for each country element because they are each in their own default namespace:

root.xpath('.//prefix:country', namespaces={'prefix': 'aaa:bbb:ccc:singapore:eee'})
[<Element {aaa:bbb:ccc:singapore:eee}country at 0x7fea879cfd40>]

and so on. That is very impractical, not because lxml or namespaces are complicated, but because someone designed this XML format badly.


By the way, once you found one of those elements you can use nsmap again to test if what I say above is true:

root.xpath('.//prefix:country', namespaces={'prefix': 'aaa:bbb:ccc:liechtenstein:eee'})[0].nsmap
{None: 'aaa:bbb:ccc:liechtenstein:eee'}

Upvotes: 3

Daniel Haley
Daniel Haley

Reputation: 52858

Another option is to use {*} as the namespace wildcard...

cntry_node = root.find('.//{*}country')

Note: This only works with find(), findall(), iter(), etc.; not xpath().

See here for more details.

Upvotes: 5

Jack Fleeting
Jack Fleeting

Reputation: 24930

You could declare all namespaces, but given the structure of your sample xml, I would argue you are better off disregarding namespaces altogether and just using local-name(); so

cntry_node = root.xpath('.//*[local-name()="country"]')
cntry_node

returns

[<Element {aaa:bbb:ccc:liechtenstein:eee}country at 0x1cddf1d4680>,
 <Element {aaa:bbb:ccc:singapore:eee}country at 0x1cddf1d47c0>,
 <Element {aaa:bbb:ccc:panama:eee}country at 0x1cddf1d45c0>]

Upvotes: 6

Related Questions