Reputation: 787
I have multiple large files that I need to import and iterate through them - all of them are xmls and have the same tree structure. The structure is something like this with some extra text apart from the ID so under the Start there are more children element tags: What I would like to do, is to input a list of Ids which I know is wrong and remove that report from the whole XML file. One report is between two "T"s.
<Header>
<Header2>
<Header3>
<T>
<Start>
<Id>abcd</Id>
</Start>
</T>
<T>
<Start>
<Id>qrlf</Id>
</Start>
</T>
</Header3>
</Header2>
</Header>
What I have so far:
from xml.etree import cElementTree as ET
file_path = '/path/to/my_xml.xml'
to_remove = []
root = None
for event, elem in ET.iterparse(file_path, events=("start", "end")):
if event == 'end':
if elem.tag == 'Id':
new_root = elem
#print([elem.tag for elem in new_root.iter()])
for elem2 in new_root.iter('Id'):
id = elem2.text
if id =='abcd':
print(id)
to_remove.append(new_root)
root = elem
for item in to_remove:
root.remove(item)
So the above code obviously doesn't work as the root is the whole xml file starting with Header and it can't find exactly the subelement that I am referring to remove, as its parent is Header3 not Header.
So the desired output would be:
<Header>
<Header2>
<Header3>
<T>
<Start>
<Id>qrlf</Id>
</Start>
</T>
</Header3>
</Header2>
</Header>
Going forward it is not a single value that I am to input to remove but thousands of values, so going to be a list, I just thought it is easier to represent the problem this way. Any help is appreciated.
Upvotes: 0
Views: 1108
Reputation: 167716
I think you can use
ids_to_remove = ['abcd']
elements_to_remove = []
for event, element in ET.iterparse('file.xml'):
if element.tag == 'T' and element.find('Start/Id').text in ids_to_remove:
elements_to_remove.append(element)
if element.tag == 'Header3':
for el in elements_to_remove:
element.remove(el)
el.clear()
if element.tag == 'Header':
root = element
ET.dump(root)
I haven't tested how that works with huge files, obviously it collects all elements to be removed first and finally removes them, I am not sure there is a way in the ElementTree API to detach element
in the if element.tag == 'T' and element.find('Start/Id').text in ids_to_remove:
branch, perhaps the following frees the element earlier:
ids_to_remove = ['abcd', 'baz', 'bar']
for event, element in ET.iterparse('file.xml', events = ['start', 'end']):
if event == 'end' and element.tag == 'T' and element.find('Start/Id').text in ids_to_remove:
header3.remove(element)
element.clear()
if event == 'start' and element.tag == 'Header3':
header3 = element;
if element.tag == 'Header':
root = element
ET.dump(root)
Upvotes: 1
Reputation: 720
Since your XML stucture is simple it's probably easier to use Xpath (about 1/3rd the way down https://docs.python.org/3/library/xml.etree.elementtree.html). The following are the usage examples from that section of the documentation page:
import xml.etree.ElementTree as ET
root = ET.fromstring(countrydata)
# Top-level elements
root.findall(".")
# All 'neighbor' grand-children of 'country' children of the top-level
# elements
root.findall("./country/neighbor")
# Nodes with name='Singapore' that have a 'year' child
root.findall(".//year/..[@name='Singapore']")
# 'year' nodes that are children of nodes with name='Singapore'
root.findall(".//*[@name='Singapore']/year")
# All 'neighbor' nodes that are the second child of their parent
root.findall(".//neighbor[2]")
The XML stucture used for the examples can be found at the top of the doc page.
The second example shows an easy way to select the subelements you want to be removed ("T" in your case) but in your case the 2nd last case may be more useful. But see the [tag='text'] operation in the Xpath Syntax section that appears just below the examples.
Send the results of that operation to the remove operation (~3/4 down the page) followed by the XMLtree write operation (~4/5ths down the page) to get the cleaned up XML.
The above assumes you are passing a string, you have to use parse to input from a file, e.g :
import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()
** DISCLAIMER *** I'm doing similar work but I haven't actually tried doing this. So think of this as inspiration, not as a complete solution.
BTW, I'm using python 3.7.4. For those who don't alreaay know, you can use the version selector at the top left of the doc page to select the version you are using.
Upvotes: 1