Reputation: 5940

How to parse .xml file with multiple nested children in python?

I am using python to parse a .xml file which is quite complicated since it has a lot of nested children; accessing some of the values contained in it is quite annoying since the code starts to become pretty bad looking.

Let me first present you the .xml file:

<?xml version="1.0" encoding="utf-8"?>
<Start>
  <step1 stepA="5" stepB="6" />
  <step2>
    <GOAL1>11111</GOAL1>
    <stepB>
      <stepBB>
        <stepBBB stepBBB1="pinco">1</stepBBB>
      </stepBB>
      <stepBC>
        <stepBCA>
          <GOAL2>22222</GOAL2>
        </stepBCA>
      </stepBC>
      <stepBD>-NO WOMAN NO CRY                                            
              -I SHOT THE SHERIF                                                           
              -WHO LET THE DOGS OUT
      </stepBD>
    </stepB>
  </step2>
  <step3>
    <GOAL3 GOAL3_NAME="GIOVANNI" GOAL3_ID="GIO">
      <stepB stepB1="12" stepB2="13" />
      <stepC>XXX</stepC>
      <stepC>
        <stepCC>
          <stepCC GOAL4="saf12">33333</stepCC>
        </stepCC>
      </stepC>
    </GOAL3>
  </step3>
  <step3>
    <GOAL3 GOAL3_NAME="ANDREA" GOAL3_ID="DRW">
      <stepB stepB1="14" stepB2="15" />
      <stepC>YYY</stepC>
      <stepC>
        <stepCC>
          <stepCC GOAL4="fwe34">44444</stepCC>
        </stepCC>
      </stepC>
    </GOAL3>
  </step3>
</Start>

My goal would be to access the values contained inside of the children named "GOAL" in a nicer way then the one I wrote in my sample code below. Furthermore I would like to find an automated way to find the values of GOALS having the same type of tag belonging to different children having the same name:

Example: GIOVANNI and ANDREA are both under the same kind of tag (GOAL3_NAME) and belong to different children having the same name (<step3>) though.

Here is the code that I wrote:

import xml.etree.ElementTree as ET
data = ET.parse('test.xml').getroot()

GOAL1 = data.getchildren()[1].getchildren()[0].text
print(GOAL1)

GOAL2 = data.getchildren()[1].getchildren()[1].getchildren()[1].getchildren()[0].getchildren()[0].text
print(GOAL2)

GOAL3 = data.getchildren()[2].getchildren()[0].text
print(GOAL3)

GOAL4_A = data.getchildren()[2].getchildren()[0].getchildren()[2].getchildren()[0].getchildren()[0].text
print(GOAL4_A)

GOAL4_B = data.getchildren()[3].getchildren()[0].getchildren()[2].getchildren()[0].getchildren()[0].text
print(GOAL4_B)

and the output that I get is the following:

The output that I would like should be like this:

11111 
22222
GIOVANNI
33333
ANDREA
44444

As you can see I am able to read GOAL1 and GOAL2 easily but I am looking for a nicer code practice to access those values since it seems to me too long and hard to read/understand.

The second thing I would like to do is getting GOAL3 and GOAL4 in a automated way so that I do not have to repeat similar lines of codes and make it more readable and understandable.

Note: as you can see I was not able to read GOAL3. If possible I would like to get both the GOAL3_NAME and GOAL3_ID

In order to make the .xml file structure more understandable I post an image of what it looks like:

The highlighted elements are what I am looking for.

Upvotes: 3

Answers (2)

nick_gabpe

Reputation: 5815

import xml.etree.cElementTree as ET
data = ET.parse('test.xml')    
for d in data.iter():
       if d.tag in ["GOAL1", "GOAL2", "stepCC", "stepCC"]:
          print d.text
       elif d.tag in ["GOAL3", "GOAL4"]:
          print d.attrib.values()[0]

Upvotes: 1

Ari Gold

Reputation: 1548

here is simple example for iterating from head to tail with a recursive method and cElementTree(15-20x faster), you can than collect the needed information from that

import xml.etree.cElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
def get_tail(root):
    for child in root:
        print child.text
        get_tail(child)
get_tail(root)

Upvotes: 1

How to parse .xml file with multiple nested children in python?

Answers (2)

Related Questions