Maikiii
Maikiii

Reputation: 451

Python : Build the differents paths/trees from a xml file

Here is an example of a xml file :

<?xml version="1.0" encoding="utf-8"?>
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/">
  <SOAP-ENV:Header />
  <SOAP-ENV:Body>
    <ADD_LandIndex_001>
      <CNTROLAREA>
        <BSR>
          <status>ADD</status>
          <NOUN>LandIndex</NOUN>
          <REVISION>001</REVISION>
        </BSR>
      </CNTROLAREA>
      <DATAAREA>
        <LandIndex>
          <reportId>AMI100031</reportId>
          <requestKey>R3278458</requestKey>
          <SubmittedBy>EN4871</SubmittedBy>
          <submittedOn>2015/01/06 4:20:11 PM</submittedOn>
          <LandIndex>
            <agreementdetail>
              <agreementid>001       4860</agreementid>
              <agreementtype>NATURAL GAS</agreementtype>
              <currentstatus>
                <status>ACTIVE</status>
                <statuseffectivedate>1965/02/18</statuseffectivedate>
                <termdate>1965/02/18</termdate>
              </currentstatus>
              <designatedrepresentative>
              </designatedrepresentative>
            </agreementdetail>
          </LandIndex>
        </LandIndex>
      </DATAAREA>
    </ADD_LandIndex_001>
  </SOAP-ENV:Body>
</SOAP-ENV:Envelope>

I would like to store in a list all the differents paths that have a text in my xml file. So I would like something like that :

['Envelope/Body/ADD_LandIndex_01/CNTROLAREA/BSR/status', 'Envelope/Body/ADD_LandIndex_01/CNTROLAREA/BSR/LandIndex', ...]

I try a little code that does not work. I don't see how to take seperatly the last elements of one branch and how to all the path from the beginning when I switch of node in the middle (i.e Envelope/Body/ADD_LandIndex_01/DATAAREA...

import xml.etree.ElementTree as et
import os
import pandas as pd
from re import search

filename = 'file_try.xml'
element_tree = et.parse(filename)
root = element_tree.getroot()
namespace = "{http://schemas.xmlsoap.org/soap/envelope/}"


def remove_namespace(string,namespace) :
    
    if search(namespace, string) :
        new_string = string.replace(namespace,'')
    else : 
        new_string= string
    return new_string

dico = {}
title = root.tag
print(title)

for element in root.findall('.//') :
    #print(element)
    if len(list(element)) > 0 :
        #print('True ') 
        title= remove_namespace(title + '/' + element.tag, namespace)
        print(title+ '\n')

    else :
        
        title = root.tag

Can anyone help me ?

Thank you

Upvotes: 1

Views: 45

Answers (1)

Jack Fleeting
Jack Fleeting

Reputation: 24940

You can modify this for you actual code, but basically - it should look like this:

from lxml import etree
soap = """[your xml above]"""
root = etree.XML(soap.encode())    
tree = etree.ElementTree(root)
for target in root.xpath('//text()'):
    if len(target.strip())>0:       
        print(tree.getpath(target.getparent()).replace('SOAP-ENV:',''))

Output:

/Envelope/Body/ADD_LandIndex_001/CNTROLAREA/BSR/status
/Envelope/Body/ADD_LandIndex_001/CNTROLAREA/BSR/NOUN
/Envelope/Body/ADD_LandIndex_001/CNTROLAREA/BSR/REVISION
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/reportId
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/requestKey
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/SubmittedBy
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/submittedOn
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/agreementid
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/agreementtype
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/status
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/statuseffectivedate
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/termdate

Upvotes: 1

Related Questions