Reputation: 125524
With this XML
<?xml version="1.0" encoding="UTF-8"?>
<Envelope>
<subject>Reference rates</subject>
<Sender>
<name>European Central Bank</name>
</Sender>
<Cube>
<Cube time='2013-12-20'>
<Cube currency='USD' rate='1.3655'/>
<Cube currency='JPY' rate='142.66'/>
</Cube>
</Cube>
</Envelope>
I can get the inner Cube
tags like this
from xml.etree.ElementTree import ElementTree
t = ElementTree()
t.parse('eurofxref-daily.xml')
day = t.find('Cube/Cube')
print 'Day:', day.attrib['time']
for currency in day:
print currency.items()
Day: 2013-12-20
[('currency', 'USD'), ('rate', '1.3655')]
[('currency', 'JPY'), ('rate', '142.66')]
The problem is that the above XML is a cleaned version of the original file which has defined namespaces
<?xml version="1.0" encoding="UTF-8"?>
<gesmes:Envelope xmlns:gesmes="http://www.gesmes.org/xml/2002-08-01" xmlns="http://www.ecb.int/vocabulary/2002-08-01/eurofxref">
<gesmes:subject>Reference rates</gesmes:subject>
<gesmes:Sender>
<gesmes:name>European Central Bank</gesmes:name>
</gesmes:Sender>
<Cube>
<Cube time='2013-12-20'>
<Cube currency='USD' rate='1.3655'/>
<Cube currency='JPY' rate='142.66'/>
</Cube>
</Cube>
</gesmes:Envelope>
When I try to get the first Cube
tag I get a None
t = ElementTree()
t.parse('eurofxref-daily.xml')
print t.find('Cube')
None
The root tag includes the namespace
root = t.getroot()
print 'root.tag:', root.tag
root.tag: {http://www.gesmes.org/xml/2002-08-01}Envelope
Its children also
for e in root.getchildren():
print 'e.tag:', e.tag
e.tag: {http://www.gesmes.org/xml/2002-08-01}subject
e.tag: {http://www.gesmes.org/xml/2002-08-01}Sender
e.tag: {http://www.ecb.int/vocabulary/2002-08-01/eurofxref}Cube
I can get the Cube
tags if I include the namespace in the tag
day = t.find('{http://www.ecb.int/vocabulary/2002-08-01/eurofxref}Cube/{http://www.ecb.int/vocabulary/2002-08-01/eurofxref}Cube')
print 'Day: ', day.attrib['time']
Day: 2013-12-20
But that is really ugly. Apart from cleaning the file before processing or doing string manipulation is there an elegant way to handle it?
Upvotes: 2
Views: 577
Reputation: 151531
There's a more elegant way than including the whole namespace URI in the text of the query. For a python version that does not support the namespaces
argument on ElementTree.find
, lxml
provides the missing functionality and is "mostly compatible" with xml.etree
:
from lxml.etree import ElementTree
t = ElementTree()
t.parse('eurofxref-daily.xml')
namespaces = { "exr": "http://www.ecb.int/vocabulary/2002-08-01/eurofxref" }
day = t.find('exr:Cube', namespaces)
print day
Using the namespaces
object, you can set it once and for all and then just use prefixes in your queries.
Here is the output:
$ python test.py
<Element '{http://www.ecb.int/vocabulary/2002-08-01/eurofxref}Cube' at 0x7fe0f95e3290>
If you find prefixes inelegant, then you have to work on a file without namespaces. Or there may be other tools out there that will "cheat" and match on local-name() even if namespaces are in effect but I don't use them.
In python 2.7 or python 3.3, or higher, you could use the same code as above but use xml.etree
instead of lxml
because they've added support for namespaces to these versions.
Upvotes: 2