Remove duplicate elements from XML

Question

Before I came here, I've searched for similar possible questions to help me but, none of them worked for my problem.

I have a xml file which has informations about movies. What I want is to remove duplicated values of </code>. Bellow there's some part of the xml:</p> <pre><code><imdb> <movie> <title>Alexander and the Terrible, Horrible, No Good, Very Bad Day 2009 Moritz, Neal H. Columbia Pictures [us] Columbia Pictures [us] Blood and Donuts 2000 Kauffmann, Matthew Benedetto, Marisa Schwarz, Jeffrey Schwarz, Jeffrey Alexander and the Terrible, Horrible, No Good, Very Bad Day 2009 Moritz, Neal H. Columbia Pictures [us] Columbia Pictures [us]

As you can see, there's two "Alexander and the Terrible, Horrible, No Good, Very Bad Day" movie title inside . I want to remove all duplicated elements (like this example there's others along the xml), and at the end, create a new xml file with no duplicate values.

The desired new xml should look like this:


    
     Alexander and the Terrible, Horrible, No Good, Very Bad Day
     2009
     Moritz, Neal H.
     Columbia Pictures [us]
     Columbia Pictures [us]
    
    
     Blood and Donuts
     2000
     Kauffmann, Matthew
     Benedetto, Marisa
     Schwarz, Jeffrey
     Schwarz, Jeffrey

I saw this example from here Python remove duplicate elements from xml tree but not worked for me. Code I tried it's no mine. I did some changes to modify for what I need but didn't work. Can you guys help me on how to do this? Thank you for your help.

import xml.etree.ElementTree as ET

path = 'imdb.xml'

tree = ET.parse(path)
root = tree.getroot()
prev = None

def elements_equal(e1, e2):
    if type(e1) != type(e2):
        return False
    if e1.tag != e2.tag: return False
    if e1.text != e2.text: return False
    if e1.tail != e2.tail: return False
    if e1.attrib != e2.attrib: return False
    if len(e1) != len(e2): return False
    return all([elements_equal(c1, c2) for c1, c2 in zip(e1, e2)])

for page in root:                     # iterate over pages
    elems_to_remove = []
    for elem in page:
        if elements_equal(elem, prev):
            print("found duplicate: %s" % elem.text)   # equal function works well
            elems_to_remove.append(elem)
            continue
        prev = elem
    for elem_to_remove in elems_to_remove:
        page.remove(elem_to_remove)
# [...]
tree.write("new_imdb.xml")

UPDATE !

I tried using xslt. It worked for my example, but when I run to the whole imdb.xml it's not working. It's not removing duplicate entries. Any help?

That's the code I used:

Remove duplicate elements from XML

Answers (1)

Related Questions