Dragons Taco
Dragons Taco

Reputation: 36

What is the correct way to remove multiple elements using ElementTree?

I've been using python and ElementTree to manipulate rather large xml files with mixed success. I find that I have difficulty removing multiple elements, especially when they are children of the root. If I have 4 elements number 1-4 , only 1 and 3 will be removed using "for elem in root" clause.

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<CrossDEV culture-info="en-US" platform-version="2.40.8" product-version="2.40.8">
<MyStuff.Interface.Common.Objects.ActionItem ImportMode="Default">
    <TargetObjectKey>FOOSTUFF1</TargetObjectKey>
</MyStuff.Interface.Common.Objects.ActionItem>
<MyStuff.Interface.Common.Objects.ActionItem ImportMode="Default">
    <TargetObjectKey>FOOSTUFF2</TargetObjectKey>
</MyStuff.Interface.Common.Objects.ActionItem>
<MyStuff.Interface.Common.Objects.ActionItem ImportMode="Default">
    <TargetObjectKey>FOOSTUFF3</TargetObjectKey>
</MyStuff.Interface.Common.Objects.ActionItem>
<MyStuff.Interface.Common.Objects.ActionItem ImportMode="Default">
    <TargetObjectKey>FOOSTUFF4</TargetObjectKey>
</MyStuff.Interface.Common.Objects.ActionItem>
</CrossDEV>

Code:

def RemoveElementActionItem():
        sTag = 'SourceObjectKey'
        sTag2 = 'TargetObjectKey'
        sPattern = 'CHE-ZUG'
        r=0
        e=0
        global myroot
        if myroot is not None:
                print ('Root:', myroot)
                for elem in myroot:
                        e+=1
                        print ('Elem:',e, elem)
                        aRemove = True
                        bRemove = True
                        o = elem.find(sTag)
                        if o is not None and o.text.find(sPattern,0) > -1:
                                aRemove = False

                        p = elem.find(sTag2)
                        if o is not None and o.text.find(sPattern,0) > -1:
                                bRemove = False

                        if bRemove and aRemove:
                                myroot.remove(elem)
                                r+=1
                                print ('Removed:', myroot, elem)
                        else:
                                print ('   Keep:', myroot, elem, o , p, aRemove, bRemove)
        return r

In the code above I am searching the grandchildren for specific text values. I've cobbled together a simple xml file that each ActionItem fails it's test, and therefore should be removed. Instead only 2 of the 4 get removed.

My guess is that when the first from the list is removed, the addresses change so that the second is skipped. Next the 3rd one is removed and the list shifts forward again.

Since in this simple case all 4 elements should be removed, what is a better way to construct my code? I'd prefer to stick to the same library if I can since I've invested lots of time in it and haven't explored lxml or other libraries yet.

Note, I've been playing with different ways to scope the root object (myroot). I've had it as a parameter, a return value and here as a global. I've had the same results each way.

Upvotes: 0

Views: 1000

Answers (2)

Kevin
Kevin

Reputation: 1

While the other answer here is very useful, I personally was not able make use of it as I was unable to have each child have the same name. Instead, the way that I ended up iterating through my Element Tree was with a while loop, where I decrement the end variable (instead of incrementing the counter) in cases where I have to remove a child.

By reducing the end target, you avoid "out of bounds" errors

Here's an example of what this looks like just iterating through a string for simplicity:

i = 0
word = 'Python'
end = len(word)

while i < end:
    letter = word[i]
    if letter == 'h':
        word = word.replace(letter, '')
        end-=1
        continue
    print('Current Letter :' + letter)
    i+=1

If you apply this to an element tree, the code looks more or less the same, except instead of using replace on a char in a string, you'll use root.remove(child) where child = root[i]

I hope this is able to help someone. Thanks for reading.

Upvotes: 0

CristiFati
CristiFati

Reputation: 41137

code.py:

import sys
from xml.etree import ElementTree as ET


XML_STR = """\
<?xml version="1.0" encoding="utf-8"?>
<RootNode>
  <ChildNode DummyIndex="0">
    <GrandChildNode DummyIndex="0">GrandChildText</GrandChildNode>
    <GrandChildNode_ToRemove DummyIndex="0">GrandChildText</GrandChildNode_ToRemove>
  </ChildNode>
  <ChildNode DummyIndex="1">
    <GrandChildNode_ToDelete DummyIndex="0">GrandChildText</GrandChildNode_ToDelete>
    <GrandChildNode_ToRemove DummyIndex="0">GrandChildText</GrandChildNode_ToRemove>
  </ChildNode>
  <ChildNode DummyIndex="2">
    <GrandChildNode DummyIndex="0">GrandChildText</GrandChildNode>
    <GrandChildNode_ToDelete DummyIndex="0">GrandChildText</GrandChildNode_ToDelete>
    <GrandChildNode_ToRemove DummyIndex="0">GrandChildText</GrandChildNode_ToRemove>
  </ChildNode>
  <ChildNode DummyIndex="3">
    <GrandChildNode_ToRemove DummyIndex="0">GrandChildText</GrandChildNode_ToRemove>
    <GrandChildNode_ToRemove DummyIndex="1">GrandChildText</GrandChildNode_ToRemove>
  </ChildNode>
  <ChildNode DummyIndex="4">
    <GrandChildNode_ToDelete DummyIndex="0">GrandChildText</GrandChildNode_ToDelete>
    <GrandChildNode_ToRemove DummyIndex="0">GrandChildText</GrandChildNode_ToRemove>
    <GrandChildNode_ToDelete DummyIndex="1">GrandChildText</GrandChildNode_ToDelete>
  </ChildNode>
  <ChildNode DummyIndex="5">
    <GrandChildNode_ToDelete DummyIndex="0">GrandChildText</GrandChildNode_ToDelete>
  </ChildNode>
  <ChildNode DummyIndex="6">
    <GrandChildNode DummyIndex="0">GrandChildText</GrandChildNode>
    <GrandChildNode_ToRemove DummyIndex="0">GrandChildText</GrandChildNode_ToRemove>
  </ChildNode>
  <ChildNode DummyIndex="7">
    <GrandChildNode_ToDelete DummyIndex="0">GrandChildText</GrandChildNode_ToDelete>
    <GrandChildNode_ToRemove DummyIndex="0">____OTHERTEXT____</GrandChildNode_ToRemove>
    <GrandChildNode DummyIndex="0">GrandChildText</GrandChildNode>
  </ChildNode>
  <ChildNode DummyIndex="8"/>
</RootNode>
"""

REMOVE_GRANDCHILD_TAGS = ["GrandChildNode_ToDelete", "GrandChildNode_ToRemove"]
REMOVE_GRANDCHILD_TEXT = "Child"


def is_node_subject_to_delete(node):
    removable_child_nodes_count = 0
    for remove_tag in REMOVE_GRANDCHILD_TAGS:
        for child_node in node.findall(remove_tag):
             if REMOVE_GRANDCHILD_TEXT in child_node.text:
                removable_child_nodes_count += 1
                break
    return removable_child_nodes_count == len(REMOVE_GRANDCHILD_TAGS)


def main():
    print("Python {:s} on {:s}\n".format(sys.version, sys.platform))
    #print(XML_STR)
    root_node = ET.fromstring(XML_STR)
    print("Root node has {:d} children\n".format(len(root_node.findall("ChildNode"))))
    to_remove_child_nodes = list()
    for child_node in root_node:
        if is_node_subject_to_delete(child_node):
            to_remove_child_nodes.append(child_node)
    print("Removing nodes:")
    for to_remove_child_node in to_remove_child_nodes:
        print("\n  Tag: {}\n  Text: {}\n  Attrs: {}".format(to_remove_child_node.tag, to_remove_child_node.text.strip(), to_remove_child_node.items()))
        root_node.remove(to_remove_child_node)
    print("\nRoot node has {:d} children\n".format(len(root_node.findall("ChildNode"))))


if __name__ == "__main__":
    main()

Notes:

  • XML_STR: example xml (could be also placed in a separate file)

    • Consists of a root node ("RootNode") that has a number (>= 0) of child nodes ("ChildNode"):
    • Each child node:
      • Is named "ChildNode"
      • Has a number (>= 0) of child nodes of their own ("GrandChildNode*")
      • Each child (root grand child) node:
        • Has one of the following names (I guess the name endings are more than self explanatory):
          1. "GrandChildNode"
          2. "GrandChildNode_ToDelete"
          3. "GrandChildNode_ToRemove"
        • Has a text (might be NULL, empty or simply consisting of non printable chars)
  • REMOVE_GRANDCHILD_TAGS - List of tag names, so that if a (root child) node has children matching all tags in the list, it can be removed - replacement of sTag and sTag2 - (check is_node_subject_to_delete notes below),
    If another tag (e.g. GrandChildNode_ToErase) is needed, it can just be added it to the list (no other copy / paste operations needed)

  • REMOVE_GRANDCHILD_TEXT - A 2nd condition to the previous item: if the node text name contains that text ("Child") - if both conditions are met, the node is deleteable
  • is_node_subject_to_delete(node) - checks whether the argument (node- root child) can be deleted:
    • Rules, as above (both have to be true (&&)):
      1. If the tag is in the blacklist (REMOVE_GRANDCHILD_TAGS) - instead of duplicating code, it's a for (outermost) loop
      2. If the text contains the "doomed" text
  • main - A general wrapper function
    • As seen, only nodes with index 1, 2 and 4 (0 based) are suitable for deletion
    • Interface to the user
    • I propose iterating over the root child nodes once (the golden rule: "never invalidate the iterable iterating on" - which happened in your case), and for each node, if deleteable, save it to a list (due to the fact that Python works with references, it's not costly), and at the end, delete all elements in that list (if any)
    • It's definitely more efficient than "break out of the loop" suggested (especially when dealing with huge number of root child nodes)

Output:

(py35x64_test) e:\Work\Dev\StackOverflow\q049667831>"e:\Work\Dev\VEnvs\py35x64_test\Scripts\python.exe" code.py
Python 3.5.4 (v3.5.4:3f56838, Aug  8 2017, 02:17:05) [MSC v.1900 64 bit (AMD64)] on win32

Root node has 9 children

Removing nodes:

  Tag: ChildNode
  Text:
  Attrs: [('DummyIndex', '1')]

  Tag: ChildNode
  Text:
  Attrs: [('DummyIndex', '2')]

 Tag: ChildNode
  Text:
  Attrs: [('DummyIndex', '4')]

Root node has 6 children

Upvotes: 0

Related Questions