Reputation: 11
I am aware this is an almost duplicate of this solution
However I am not able to work out why for my example (similar to a real data issue) why it is not fully removing the duplicates.
the code I am using removes 2 of the 4 and not 3?
I am attempting to create a python script that cleans duplicates from xml files.
code;
tree = etree.parse(path)
root = tree.getroot()
def elements_equal(e1, e2):
if type(e1) != type(e2):
return False
if e1.tag != e2.tag:
return False
if e1.text != e2.text:
return False
if e1.tail != e2.tail:
return False
if e1.attrib != e2.attrib:
return False
if len(e1) != len(e2):
return False
return all([elements_equal(c1, c2) for c1, c2 in zip(e1, e2)])
prev = ""
for page in root:
elems_to_remove = []
for elem in page:
if elements_equal(elem, prev):
print("found duplicate: %s" % elem.text)
elems_to_remove.append(elem)
continue
prev = elem
for elem_to_remove in elems_to_remove:
page.remove(elem_to_remove)
tree.write("clean.xml")
xml;
<?xml version="1.0" encoding="UTF-8"?>
<emails>
<email>
<to>Vimal</to>
<from>Sonoo</from>
<heading>Hello</heading>
<body>Hello brother, how are you!</body>
<body>Hello brother, how are you!</body>
<body>Hello brother, how are you!</body>
<body>Hello brother, how are you!</body>
</email>
<email>
<to>Peter</to>
<from>Jack</from>
<heading>Birth day wish</heading>
<body>Happy birth day Tom!</body>
</email>
<email>
<to>James</to>
<from>Jaclin</from>
<heading>Morning walk</heading>
<body></body>
</email>
<email>
<to>Kartik</to>
<from>Kumar</from>
<heading>Health Tips</heading>
<body>Smoking is injurious to health!</body>
</email>
</emails>
Hopefully this is just a situation of me missing something obvious and I can learn what that is and move on happy.
Upvotes: 1
Views: 237
Reputation: 24928
The reason you're getting this outcome is that there is a difference between the 3rd and 4th <body>
elements - the length of their tail
properties (7 and 3, respectively). Consequently,
if e1.tail != e2.tail:
return False
returns False
.
You can handle it by either removing tail
equality as a test or modifying the xml itself.
Upvotes: 1