Python remove duplicate elements from xml tree

Question

I have a xml structure with some elements which are not unique. So I managed to sort the subtrees and I can filter propper the elements which I have more than one time. But the remove function seems not to apply.

My XML Structure looks simplified like this:


  
    blabla blub unique
    blabla blub not unique
    blabla blub not unique
    blabla blub not unique
    blabla blub not unique
    blabla blub again unique
  
  
    2nd blabla blub unique
    2nd blabla blub not unique
    2nd blabla blub not unique
    2nd blabla blub again unique

I want to remove double strings on each page, so I'm iterating over pages and over elements in page in two for loops: (extract of important lines, I hope didn't forget anything)

import xml.etree.ElementTree as ET
self.tree = ET.parse(path)
self.root = self.tree.getroot()
self.prev = None
# [...]
for page in self.root:                     # iterate over pages
    for elem in page:
        if elements_equal(elem, self.prev):
            print("found duplicate: %s" % elem.text)   # equal function works well
            page.remove(elem) # <---- removes just one line
            continue
        self.prev = elem
# [...]
self.tree.write("out.xml") # 2 duplicate lines still there....

update: The code seems to work, but it removes just one duplicate, not all

xnx · Accepted Answer

I don't know how you've defined elements_equal, but (shamelessly adapted from Testing Equivalence of xml.etree.ElementTree) this works for me:

EDIT: store a list of each element to be removed whilst iterating over page and then remove them rather than doing the removal within one loop.

EDIT: Noticed a small typo in the code in the comparison of the element tags and correct it.

import xml.etree.ElementTree as ET

path = 'in.xml'

tree = ET.parse(path)
root = tree.getroot()
prev = None

def elements_equal(e1, e2):
    if type(e1) != type(e2):
        return False
    if e1.tag != e2.tag: return False
    if e1.text != e2.text: return False
    if e1.tail != e2.tail: return False
    if e1.attrib != e2.attrib: return False
    if len(e1) != len(e2): return False
    return all([elements_equal(c1, c2) for c1, c2 in zip(e1, e2)])

for page in root:                     # iterate over pages
    elems_to_remove = []
    for elem in page:
        if elements_equal(elem, prev):
            print("found duplicate: %s" % elem.text)   # equal function works well
            elems_to_remove.append(elem)
            continue
        prev = elem
    for elem_to_remove in elems_to_remove:
        page.remove(elem_to_remove)
# [...]
tree.write("out.xml")

Gives:

$ python undupe.py
found duplicate: blabla blub not unique
found duplicate: 2nd blabla blub not unique
$ cat out.xml

  
    blabla blub unique
    blabla blub not unique
    blabla blub again unique
  
  
    2nd blabla blub unique
    2nd blabla blub not unique
    2nd blabla blub again unique

Python remove duplicate elements from xml tree

Answers (1)

Related Questions