rpb
rpb

Reputation: 3299

Getting empty list when accessing element and tag in xml file using ElementTree

The idea is to get the value of tag endTime for the following xml:

<epochs xmlns="http://www.egi.com/epochs_mff" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <epoch>
    <beginTime>0</beginTime>
    <endTime>3586221000</endTime>
    <firstBlock>1</firstBlock>
    <lastBlock>897</lastBlock>
  </epoch>
  <epoch>
    <beginTime>3750143000</beginTime>
    <endTime>5549485000</endTime>
    <firstBlock>898</firstBlock>
    <lastBlock>1347</lastBlock>
  </epoch>
</epochs>

Yet, accessing the tag directly return an empty list:

import xml.etree.ElementTree as ET
tree = ET.parse(r'epochs.xml')
epoch_list=tree.findall("epoch")

However, looping through the tree does return the endTime value.

import xml.etree.ElementTree as ET
tree = ET.parse(r'epochs.xml')

for elem in tree:
    for subelem in elem:
        print(subelem.text)

May I know how can I retrieve directly the endTime with the value of 300937000?

Upvotes: 1

Views: 800

Answers (1)

Valdi_Bo
Valdi_Bo

Reputation: 30991

The reason your code failed is that your XML uses a default namespace (xmlns="http://...").

But your call to findall contains epoch without any namespace, so it is not likely to find anything.

To process namespaced XML, you have to:

  • create a dictionary of used namespaces ({prefix: namespace}),
  • include the prefix of the relevant namespace in the XPath expression,
  • pass the above dictionary as the second argument of findall.

Something like:

ns = {'ep': 'http://www.egi.com/epochs_mff'}
epoch_list = tree.findall('ep:epoch', ns)

Then the result is:

[<Element '{http://www.egi.com/epochs_mff}epoch' at 0x...>]

And to get the content your endTime element, if you don't care about any intermediate elements in the XML tree, run:

tree.findtext('.//ep:endTime', namespaces=ns)

Other choice is to pass full XML path, starting from the content of the root element, but remember about the namespace prefix at each step:

tree.findtext('ep:epoch/ep:endTime', namespaces=ns)

If you have multiple endTime elements, one of possible solutions is to process them in a loop.

This time findtext is useless as it finds only the first matching element. You should use a loop based on findall and then (within the loop) retrieve the text of the current element and make the intended use of it, e.g.:

for it in tree.findall('ep:epoch/ep:endTime', namespaces=ns):
    print(it.text)

Of course, replace print with whatever you need to consume the text found.

Upvotes: 1

Related Questions