christianjthomas
christianjthomas

Reputation: 11

How do you remove duplicate xml nodes throughout all of the xml python

I am aware this is an almost duplicate of this solution

However I am not able to work out why for my example (similar to a real data issue) why it is not fully removing the duplicates.

the code I am using removes 2 of the 4 and not 3?

I am attempting to create a python script that cleans duplicates from xml files.

code;

tree = etree.parse(path)
root = tree.getroot()

def elements_equal(e1, e2):
    if type(e1) != type(e2):
        return False
    if e1.tag != e2.tag:
        return False
    if e1.text != e2.text:
        return False
    if e1.tail != e2.tail:
        return False
    if e1.attrib != e2.attrib:
        return False
    if len(e1) != len(e2):
        return False
    return all([elements_equal(c1, c2) for c1, c2 in zip(e1, e2)])

prev = ""
for page in root:
    elems_to_remove = []
    for elem in page:
        if elements_equal(elem, prev):
            print("found duplicate: %s" % elem.text)
            elems_to_remove.append(elem)
            continue
        prev = elem
    for elem_to_remove in elems_to_remove:
        page.remove(elem_to_remove)
tree.write("clean.xml")

xml;

<?xml version="1.0" encoding="UTF-8"?>  
<emails>  
<email>  
  <to>Vimal</to>  
  <from>Sonoo</from>  
  <heading>Hello</heading>  
    <body>Hello brother, how are you!</body>  
    <body>Hello brother, how are you!</body>  
    <body>Hello brother, how are you!</body>  
    <body>Hello brother, how are you!</body>  
</email>  
<email>  
  <to>Peter</to>  
  <from>Jack</from>  
  <heading>Birth day wish</heading>  
  <body>Happy birth day Tom!</body>  
</email>  
<email>  
  <to>James</to>  
  <from>Jaclin</from>  
  <heading>Morning walk</heading>  
  <body></body>  
</email>  
<email>  
  <to>Kartik</to>  
  <from>Kumar</from>  
  <heading>Health Tips</heading>  
  <body>Smoking is injurious to health!</body>  
</email>  
</emails>

Hopefully this is just a situation of me missing something obvious and I can learn what that is and move on happy.

Upvotes: 1

Views: 237

Answers (1)

Jack Fleeting
Jack Fleeting

Reputation: 24928

The reason you're getting this outcome is that there is a difference between the 3rd and 4th <body> elements - the length of their tail properties (7 and 3, respectively). Consequently,

if e1.tail != e2.tail:
    return False

returns False.

You can handle it by either removing tail equality as a test or modifying the xml itself.

Upvotes: 1

Related Questions