user3319356
user3319356

Reputation: 173

Extract part of xml file with python etree

I have big xml file that looks as one below. I have put just part of it, as it is >2gb, so just that you see the structure. Basicly all SubNetwork parents have same structure as the one I showed below. What I want to do is to extract only part of this xml file with the <ManagedElementId string="xxxx" /> (where xxx i the input variable). Here is my code and xml:

<Create> 
<SubNetwork networkType="GSM" userLabel="BSC">
.
.
</SubNetwork>
<SubNetwork networkType="WCDMA" userLabel="RNC01">
.
.
</SubNetwork>
<SubNetwork networkType="IPRAN" userLabel="IPRAN">
.
.
</SubNetwork>
<SubNetwork networkType="WCDMA" userLabel="RNC02">
                  <ManagedElement sourceType="CELLO">
                     <ManagedElementId string="3GALPAS" />
                     <primaryType type="RBS" />
                   .
                   .
                  </ManagedElement>
                  <ManagedElement sourceType="CELLO">
                     <ManagedElementId string="3GTUTI" />
                     <primaryType type="RBS" />
                   .
                   .
                  </ManagedElement>
                    <ManagedElement sourceType="CELLO">
                     <ManagedElementId string="3GHHH" />
                     <primaryType type="RBS" />
                   .
                   .
                  </ManagedElement>
</SubNetwork>
</Create> 

and the code

from xml.etree import ElementTree
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import XML, fromstring, tostring
from xml.etree.ElementTree import Element
from xml.etree.ElementTree import SubElement
from xml.etree.ElementTree import Element, SubElement, Comment


with open(r"C:\\Users\\etihkru\\Desktop\\h4.xml", 'rt') as f:
   root = ET.parse(f)
   tree=root.getroot()
   with open(r"C:\\Users\\etihkru\\Desktop\\list_of_xxx", 'r') as f2:
        for line in f2:
             line=line.rstrip()
             line1='"' + line + '"'
             xp_str1 = str(('.//ManagedElementId[@string='))
             xp_str2 = str("]/../../")
             str_elem = xp_str1 + line1 + xp_str2 
             for item in tree.findall(str_elem):
                    print ET.tostring(item)

and file list_of_xxx is as below:

3GALPAS
3GTUTI

As said there is numerues number of <ManagedElementId string=/>, and I just want to extract the ones that are in list_of_xxx.

So I want output as below:

<SubNetwork networkType="WCDMA" userLabel="RNC02">
                  <ManagedElement sourceType="CELLO">
                     <ManagedElementId string="3GALPAS" />
                     <primaryType type="RBS" />
                   .
                   .
                  </ManagedElement>
</SubNetwork>
<SubNetwork networkType="WCDMA" userLabel="RNC02">
                  <ManagedElement sourceType="CELLO">
                     <ManagedElementId string="3GTUTI" />
                     <primaryType type="RBS" />
                   .
                   .
                  </ManagedElement>
</SubNetwork>

So, I want to find all ManagedElementIds as given in list_of_xxx,and their parents ManagedElement and SubNetwork, and write them as given above. Every MangedElementid should be closed with parents as mentioned. I'm uing python 2.6 without lxml, as I don't have right to install it.

Upvotes: 1

Views: 4913

Answers (1)

har07
har07

Reputation: 89285

Extracting part of XML in the sense that that part exists in the source XML should be trivial. For example, getting ManagedElements containing certain ManagedElementId that you're interested in will be easy. But here you seems want them wrapped within SubNetwork parent node.

In the source XML, SubNetwork contains mix of elements you want to get and other elements you want to strip from the result, so there is actually no such SubNetwork containing only ManagedElement nodes you want.

We can approach this by extracting ManagedElement nodes from the source XML, and add them to a reconstructed parent SubNetwork node :

.....
.....
for line in f2:
    line = line.rstrip()
    #get all subnet nodes containing certain ManagedElementId
    subnet_path = ".//ManagedElementId[@string='{0}']/../.."
    subnet_path = subnet_path.format(line)
    for subnet in tree.findall(subnet_path):
        #reconstruct subnet node:
        parent = ET.Element(subnet.tag, attrib=subnet.attrib)
        #path to find all ManagedElement containing certain ManagedElementId
        content_path = ".//ManagedElementId[@string='{0}']/..".format(line)
        #append all ManagedElement found to the new subnet:
        for content in subnet.findall(content_path):
            parent.append(content)
        #print new subnet:
        print ET.tostring(parent)

Upvotes: 3

Related Questions