Reputation: 36
I've been using python and ElementTree to manipulate rather large xml files with mixed success. I find that I have difficulty removing multiple elements, especially when they are children of the root. If I have 4 elements number 1-4 , only 1 and 3 will be removed using "for elem in root" clause.
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<CrossDEV culture-info="en-US" platform-version="2.40.8" product-version="2.40.8">
<MyStuff.Interface.Common.Objects.ActionItem ImportMode="Default">
<TargetObjectKey>FOOSTUFF1</TargetObjectKey>
</MyStuff.Interface.Common.Objects.ActionItem>
<MyStuff.Interface.Common.Objects.ActionItem ImportMode="Default">
<TargetObjectKey>FOOSTUFF2</TargetObjectKey>
</MyStuff.Interface.Common.Objects.ActionItem>
<MyStuff.Interface.Common.Objects.ActionItem ImportMode="Default">
<TargetObjectKey>FOOSTUFF3</TargetObjectKey>
</MyStuff.Interface.Common.Objects.ActionItem>
<MyStuff.Interface.Common.Objects.ActionItem ImportMode="Default">
<TargetObjectKey>FOOSTUFF4</TargetObjectKey>
</MyStuff.Interface.Common.Objects.ActionItem>
</CrossDEV>
Code:
def RemoveElementActionItem():
sTag = 'SourceObjectKey'
sTag2 = 'TargetObjectKey'
sPattern = 'CHE-ZUG'
r=0
e=0
global myroot
if myroot is not None:
print ('Root:', myroot)
for elem in myroot:
e+=1
print ('Elem:',e, elem)
aRemove = True
bRemove = True
o = elem.find(sTag)
if o is not None and o.text.find(sPattern,0) > -1:
aRemove = False
p = elem.find(sTag2)
if o is not None and o.text.find(sPattern,0) > -1:
bRemove = False
if bRemove and aRemove:
myroot.remove(elem)
r+=1
print ('Removed:', myroot, elem)
else:
print (' Keep:', myroot, elem, o , p, aRemove, bRemove)
return r
In the code above I am searching the grandchildren for specific text values. I've cobbled together a simple xml file that each ActionItem fails it's test, and therefore should be removed. Instead only 2 of the 4 get removed.
My guess is that when the first from the list is removed, the addresses change so that the second is skipped. Next the 3rd one is removed and the list shifts forward again.
Since in this simple case all 4 elements should be removed, what is a better way to construct my code? I'd prefer to stick to the same library if I can since I've invested lots of time in it and haven't explored lxml or other libraries yet.
Note, I've been playing with different ways to scope the root object (myroot). I've had it as a parameter, a return value and here as a global. I've had the same results each way.
Upvotes: 0
Views: 1000
Reputation: 1
While the other answer here is very useful, I personally was not able make use of it as I was unable to have each child have the same name. Instead, the way that I ended up iterating through my Element Tree was with a while loop, where I decrement the end variable (instead of incrementing the counter) in cases where I have to remove a child.
By reducing the end target, you avoid "out of bounds" errors
Here's an example of what this looks like just iterating through a string for simplicity:
i = 0
word = 'Python'
end = len(word)
while i < end:
letter = word[i]
if letter == 'h':
word = word.replace(letter, '')
end-=1
continue
print('Current Letter :' + letter)
i+=1
If you apply this to an element tree, the code looks more or less the same, except instead of using replace
on a char in a string, you'll use root.remove(child)
where child = root[i]
I hope this is able to help someone. Thanks for reading.
Upvotes: 0
Reputation: 41137
code.py:
import sys
from xml.etree import ElementTree as ET
XML_STR = """\
<?xml version="1.0" encoding="utf-8"?>
<RootNode>
<ChildNode DummyIndex="0">
<GrandChildNode DummyIndex="0">GrandChildText</GrandChildNode>
<GrandChildNode_ToRemove DummyIndex="0">GrandChildText</GrandChildNode_ToRemove>
</ChildNode>
<ChildNode DummyIndex="1">
<GrandChildNode_ToDelete DummyIndex="0">GrandChildText</GrandChildNode_ToDelete>
<GrandChildNode_ToRemove DummyIndex="0">GrandChildText</GrandChildNode_ToRemove>
</ChildNode>
<ChildNode DummyIndex="2">
<GrandChildNode DummyIndex="0">GrandChildText</GrandChildNode>
<GrandChildNode_ToDelete DummyIndex="0">GrandChildText</GrandChildNode_ToDelete>
<GrandChildNode_ToRemove DummyIndex="0">GrandChildText</GrandChildNode_ToRemove>
</ChildNode>
<ChildNode DummyIndex="3">
<GrandChildNode_ToRemove DummyIndex="0">GrandChildText</GrandChildNode_ToRemove>
<GrandChildNode_ToRemove DummyIndex="1">GrandChildText</GrandChildNode_ToRemove>
</ChildNode>
<ChildNode DummyIndex="4">
<GrandChildNode_ToDelete DummyIndex="0">GrandChildText</GrandChildNode_ToDelete>
<GrandChildNode_ToRemove DummyIndex="0">GrandChildText</GrandChildNode_ToRemove>
<GrandChildNode_ToDelete DummyIndex="1">GrandChildText</GrandChildNode_ToDelete>
</ChildNode>
<ChildNode DummyIndex="5">
<GrandChildNode_ToDelete DummyIndex="0">GrandChildText</GrandChildNode_ToDelete>
</ChildNode>
<ChildNode DummyIndex="6">
<GrandChildNode DummyIndex="0">GrandChildText</GrandChildNode>
<GrandChildNode_ToRemove DummyIndex="0">GrandChildText</GrandChildNode_ToRemove>
</ChildNode>
<ChildNode DummyIndex="7">
<GrandChildNode_ToDelete DummyIndex="0">GrandChildText</GrandChildNode_ToDelete>
<GrandChildNode_ToRemove DummyIndex="0">____OTHERTEXT____</GrandChildNode_ToRemove>
<GrandChildNode DummyIndex="0">GrandChildText</GrandChildNode>
</ChildNode>
<ChildNode DummyIndex="8"/>
</RootNode>
"""
REMOVE_GRANDCHILD_TAGS = ["GrandChildNode_ToDelete", "GrandChildNode_ToRemove"]
REMOVE_GRANDCHILD_TEXT = "Child"
def is_node_subject_to_delete(node):
removable_child_nodes_count = 0
for remove_tag in REMOVE_GRANDCHILD_TAGS:
for child_node in node.findall(remove_tag):
if REMOVE_GRANDCHILD_TEXT in child_node.text:
removable_child_nodes_count += 1
break
return removable_child_nodes_count == len(REMOVE_GRANDCHILD_TAGS)
def main():
print("Python {:s} on {:s}\n".format(sys.version, sys.platform))
#print(XML_STR)
root_node = ET.fromstring(XML_STR)
print("Root node has {:d} children\n".format(len(root_node.findall("ChildNode"))))
to_remove_child_nodes = list()
for child_node in root_node:
if is_node_subject_to_delete(child_node):
to_remove_child_nodes.append(child_node)
print("Removing nodes:")
for to_remove_child_node in to_remove_child_nodes:
print("\n Tag: {}\n Text: {}\n Attrs: {}".format(to_remove_child_node.tag, to_remove_child_node.text.strip(), to_remove_child_node.items()))
root_node.remove(to_remove_child_node)
print("\nRoot node has {:d} children\n".format(len(root_node.findall("ChildNode"))))
if __name__ == "__main__":
main()
Notes:
XML_STR
: example xml (could be also placed in a separate file)
NULL
, empty or simply consisting of non printable chars)REMOVE_GRANDCHILD_TAGS
- List of tag names, so that if a (root child) node has children matching all tags in the list, it can be removed - replacement of sTag
and sTag2
- (check is_node_subject_to_delete
notes below),
If another tag (e.g. GrandChildNode_ToErase
) is needed, it can just be added it to the list (no other copy / paste operations needed)
REMOVE_GRANDCHILD_TEXT
- A 2nd condition to the previous item: if the node text name contains that text ("Child") - if both conditions are met, the node is deleteableis_node_subject_to_delete(node)
- checks whether the argument (node
- root child) can be deleted:
REMOVE_GRANDCHILD_TAGS
) - instead of duplicating code, it's a for
(outermost) loopmain
- A general wrapper function
Output:
(py35x64_test) e:\Work\Dev\StackOverflow\q049667831>"e:\Work\Dev\VEnvs\py35x64_test\Scripts\python.exe" code.py Python 3.5.4 (v3.5.4:3f56838, Aug 8 2017, 02:17:05) [MSC v.1900 64 bit (AMD64)] on win32 Root node has 9 children Removing nodes: Tag: ChildNode Text: Attrs: [('DummyIndex', '1')] Tag: ChildNode Text: Attrs: [('DummyIndex', '2')] Tag: ChildNode Text: Attrs: [('DummyIndex', '4')] Root node has 6 children
Upvotes: 0