Anna Semjén
Anna Semjén

Reputation: 787

How can I remove XML parts with iterparse with parents included using ElementTree in Python?

I have multiple large files that I need to import and iterate through them - all of them are xmls and have the same tree structure. The structure is something like this with some extra text apart from the ID so under the Start there are more children element tags: What I would like to do, is to input a list of Ids which I know is wrong and remove that report from the whole XML file. One report is between two "T"s.

<Header>
        <Header2>
           <Header3>
           <T>
              <Start> 
                <Id>abcd</Id>
              </Start>
           </T>
           <T>
              <Start> 
                <Id>qrlf</Id>
              </Start>
           </T>
           </Header3>
        </Header2>
</Header>

What I have so far:

from xml.etree import cElementTree as ET

file_path = '/path/to/my_xml.xml'
to_remove = []
root = None
for event, elem in ET.iterparse(file_path, events=("start", "end")):
if event == 'end':
    if elem.tag == 'Id':
        new_root = elem
        #print([elem.tag for elem in new_root.iter()])
        for elem2 in new_root.iter('Id'):
             id = elem2.text
             if id =='abcd':
                print(id)
                to_remove.append(new_root)
root = elem
for item in to_remove:
    root.remove(item)

So the above code obviously doesn't work as the root is the whole xml file starting with Header and it can't find exactly the subelement that I am referring to remove, as its parent is Header3 not Header.

So the desired output would be:

<Header>
        <Header2>
           <Header3>
           <T>
              <Start> 
                <Id>qrlf</Id>
              </Start>
           </T>
           </Header3>
        </Header2>
</Header>

Going forward it is not a single value that I am to input to remove but thousands of values, so going to be a list, I just thought it is easier to represent the problem this way. Any help is appreciated.

Upvotes: 0

Views: 1108

Answers (2)

Martin Honnen
Martin Honnen

Reputation: 167716

I think you can use

ids_to_remove = ['abcd']

elements_to_remove = []

for event, element in ET.iterparse('file.xml'):
    if element.tag == 'T' and element.find('Start/Id').text in ids_to_remove:
        elements_to_remove.append(element)
    if element.tag == 'Header3':
        for el in elements_to_remove:
            element.remove(el)
            el.clear()
    if element.tag == 'Header':
        root = element

ET.dump(root)

I haven't tested how that works with huge files, obviously it collects all elements to be removed first and finally removes them, I am not sure there is a way in the ElementTree API to detach element in the if element.tag == 'T' and element.find('Start/Id').text in ids_to_remove: branch, perhaps the following frees the element earlier:

ids_to_remove = ['abcd', 'baz', 'bar']


for event, element in ET.iterparse('file.xml', events = ['start', 'end']):
    if event == 'end' and element.tag == 'T' and element.find('Start/Id').text in ids_to_remove:
        header3.remove(element)
        element.clear()
    if event == 'start' and element.tag == 'Header3':
        header3 = element;
    if element.tag == 'Header':
        root = element


ET.dump(root)

Upvotes: 1

user1459519
user1459519

Reputation: 720

Since your XML stucture is simple it's probably easier to use Xpath (about 1/3rd the way down https://docs.python.org/3/library/xml.etree.elementtree.html). The following are the usage examples from that section of the documentation page:

import xml.etree.ElementTree as ET

root = ET.fromstring(countrydata)

# Top-level elements
root.findall(".")

# All 'neighbor' grand-children of 'country' children of the top-level
# elements
root.findall("./country/neighbor")

# Nodes with name='Singapore' that have a 'year' child
root.findall(".//year/..[@name='Singapore']")

# 'year' nodes that are children of nodes with name='Singapore'
root.findall(".//*[@name='Singapore']/year")

# All 'neighbor' nodes that are the second child of their parent
root.findall(".//neighbor[2]")

The XML stucture used for the examples can be found at the top of the doc page.

The second example shows an easy way to select the subelements you want to be removed ("T" in your case) but in your case the 2nd last case may be more useful. But see the [tag='text'] operation in the Xpath Syntax section that appears just below the examples.
Send the results of that operation to the remove operation (~3/4 down the page) followed by the XMLtree write operation (~4/5ths down the page) to get the cleaned up XML.

The above assumes you are passing a string, you have to use parse to input from a file, e.g :

import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()

** DISCLAIMER *** I'm doing similar work but I haven't actually tried doing this. So think of this as inspiration, not as a complete solution.

BTW, I'm using python 3.7.4. For those who don't alreaay know, you can use the version selector at the top left of the doc page to select the version you are using.

Upvotes: 1

Related Questions