Reputation: 173
I have big xml file that looks as one below. I have put just part of it, as it is >2gb, so just that you see the structure. Basicly all SubNetwork parents
have same structure as the one I showed below. What I want to do is to extract only part of this xml file with the <ManagedElementId string="xxxx" />
(where xxx i the input variable). Here is my code and xml:
<Create>
<SubNetwork networkType="GSM" userLabel="BSC">
.
.
</SubNetwork>
<SubNetwork networkType="WCDMA" userLabel="RNC01">
.
.
</SubNetwork>
<SubNetwork networkType="IPRAN" userLabel="IPRAN">
.
.
</SubNetwork>
<SubNetwork networkType="WCDMA" userLabel="RNC02">
<ManagedElement sourceType="CELLO">
<ManagedElementId string="3GALPAS" />
<primaryType type="RBS" />
.
.
</ManagedElement>
<ManagedElement sourceType="CELLO">
<ManagedElementId string="3GTUTI" />
<primaryType type="RBS" />
.
.
</ManagedElement>
<ManagedElement sourceType="CELLO">
<ManagedElementId string="3GHHH" />
<primaryType type="RBS" />
.
.
</ManagedElement>
</SubNetwork>
</Create>
and the code
from xml.etree import ElementTree
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import XML, fromstring, tostring
from xml.etree.ElementTree import Element
from xml.etree.ElementTree import SubElement
from xml.etree.ElementTree import Element, SubElement, Comment
with open(r"C:\\Users\\etihkru\\Desktop\\h4.xml", 'rt') as f:
root = ET.parse(f)
tree=root.getroot()
with open(r"C:\\Users\\etihkru\\Desktop\\list_of_xxx", 'r') as f2:
for line in f2:
line=line.rstrip()
line1='"' + line + '"'
xp_str1 = str(('.//ManagedElementId[@string='))
xp_str2 = str("]/../../")
str_elem = xp_str1 + line1 + xp_str2
for item in tree.findall(str_elem):
print ET.tostring(item)
and file list_of_xxx
is as below:
3GALPAS
3GTUTI
As said there is numerues number of <ManagedElementId string=/>
, and I just want to extract the ones that are in list_of_xxx
.
So I want output as below:
<SubNetwork networkType="WCDMA" userLabel="RNC02">
<ManagedElement sourceType="CELLO">
<ManagedElementId string="3GALPAS" />
<primaryType type="RBS" />
.
.
</ManagedElement>
</SubNetwork>
<SubNetwork networkType="WCDMA" userLabel="RNC02">
<ManagedElement sourceType="CELLO">
<ManagedElementId string="3GTUTI" />
<primaryType type="RBS" />
.
.
</ManagedElement>
</SubNetwork>
So, I want to find all ManagedElementIds
as given in list_of_xxx,and their parents ManagedElement
and SubNetwork
, and write them as given above. Every MangedElementid
should be closed with parents as mentioned. I'm uing python 2.6 without lxml, as I don't have right to install it.
Upvotes: 1
Views: 4913
Reputation: 89285
Extracting part of XML in the sense that that part exists in the source XML should be trivial. For example, getting ManagedElement
s containing certain ManagedElementId
that you're interested in will be easy. But here you seems want them wrapped within SubNetwork
parent node.
In the source XML, SubNetwork
contains mix of elements you want to get and other elements you want to strip from the result, so there is actually no such SubNetwork
containing only ManagedElement
nodes you want.
We can approach this by extracting ManagedElement
nodes from the source XML, and add them to a reconstructed parent SubNetwork
node :
.....
.....
for line in f2:
line = line.rstrip()
#get all subnet nodes containing certain ManagedElementId
subnet_path = ".//ManagedElementId[@string='{0}']/../.."
subnet_path = subnet_path.format(line)
for subnet in tree.findall(subnet_path):
#reconstruct subnet node:
parent = ET.Element(subnet.tag, attrib=subnet.attrib)
#path to find all ManagedElement containing certain ManagedElementId
content_path = ".//ManagedElementId[@string='{0}']/..".format(line)
#append all ManagedElement found to the new subnet:
for content in subnet.findall(content_path):
parent.append(content)
#print new subnet:
print ET.tostring(parent)
Upvotes: 3