Reputation: 1513
I have an XML with many levels. Each level may have namespace attached to it. I want to find
a specific element whose name I know, but not its namespace. For example:
my_file.xml
<?xml version="1.0" encoding="UTF-8"?>
<data xmlns="aaa:bbb:ccc:ddd:eee">
<country name="Liechtenstein" xmlns="aaa:bbb:ccc:liechtenstein:eee">
<rank updated="yes">2</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
<country name="Singapore" xmlns="aaa:bbb:ccc:singapore:eee">
<continent>Asia</continent>
<holidays>
<christmas>Yes</christmas>
</holidays>
<rank updated="yes">5</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</country>
<country name="Panama" xmlns="aaa:bbb:ccc:panama:eee">
<rank updated="yes">69</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/>
<neighbor name="Colombia" direction="E"/>
</country>
</data>
import lxml.etree as etree
tree = etree.parse('my_file.xml')
root = tree.getroot()
cntry_node = root.find('.//country')
The find
above does not return anything to cntry_node
. In my real data, the levels are deeper than this example. The lxml document talks about namespace. When I do this:
root.nsmap
I see this:
{None: 'aaa:bbb:ccc:ddd:eee'}
If someone could explain how to access the full nsmap
and/or how to use it to find
a specific element? Thanks very much.
Upvotes: 5
Views: 1930
Reputation: 22617
nsmap
is not a global collection of all namespaces of an XML document
I believe your impression was that nsmap
is a collection of all namespaces that are present in an XML document. And that this collection would be available after parsing the document. That is not the case.
nsmap
gives you access to the namespace definitions of one element only. So this:
root = tree.getroot()
root.nsmap
Gives you the namespace definitions known in the context of the root
element. Keep in mind that "root" is just the name of a Python variable and in fact contains the outermost element of your XML document (I know this because you called getroot()
). The outermost element of your document is:
<data xmlns="aaa:bbb:ccc:ddd:eee">
so it is expected that its nsmap would contain
{None: 'aaa:bbb:ccc:ddd:eee'}
(The nsmap has None
in it because this is a default namespace without a namespace prefix that would go where the None
is.)
XML document has a terrible structure
Usually, the best way to deal with namespaces is to define them yourself (without taking them from the input document). Suppose we would like to find the following element:
<country name="Liechtenstein" xmlns="aaa:bbb:ccc:liechtenstein:eee">
This country
element is in the default namespace with the namespace URI "aaa:bbb:ccc:liechtenstein:eee". To find it with lxml, define a mapping:
my_own_namespace_mapping = {'prefix': 'aaa:bbb:ccc:liechtenstein:eee'}
and then use it when retrieving nodes:
root.xpath('.//prefix:country', namespaces=my_own_namespace_mapping)
[<Element {aaa:bbb:ccc:liechtenstein:eee}country at 0x7fea87f363f8>]
However, in the case of your input document it appears you would need to do that separately for each country
element because they are each in their own default namespace:
root.xpath('.//prefix:country', namespaces={'prefix': 'aaa:bbb:ccc:singapore:eee'})
[<Element {aaa:bbb:ccc:singapore:eee}country at 0x7fea879cfd40>]
and so on. That is very impractical, not because lxml or namespaces are complicated, but because someone designed this XML format badly.
By the way, once you found one of those elements you can use nsmap
again to test if what I say above is true:
root.xpath('.//prefix:country', namespaces={'prefix': 'aaa:bbb:ccc:liechtenstein:eee'})[0].nsmap
{None: 'aaa:bbb:ccc:liechtenstein:eee'}
Upvotes: 3
Reputation: 52858
Another option is to use {*}
as the namespace wildcard...
cntry_node = root.find('.//{*}country')
Note: This only works with find()
, findall()
, iter()
, etc.; not xpath()
.
See here for more details.
Upvotes: 5
Reputation: 24930
You could declare all namespaces, but given the structure of your sample xml, I would argue you are better off disregarding namespaces altogether and just using local-name()
; so
cntry_node = root.xpath('.//*[local-name()="country"]')
cntry_node
returns
[<Element {aaa:bbb:ccc:liechtenstein:eee}country at 0x1cddf1d4680>,
<Element {aaa:bbb:ccc:singapore:eee}country at 0x1cddf1d47c0>,
<Element {aaa:bbb:ccc:panama:eee}country at 0x1cddf1d45c0>]
Upvotes: 6