Guilherme Schults
Guilherme Schults

Reputation: 53

Remove duplicate elements from XML

Before I came here, I've searched for similar possible questions to help me but, none of them worked for my problem.

I have a xml file which has informations about movies. What I want is to remove duplicated values of <title>. Bellow there's some part of the xml:

<imdb>
<movie>
 <title>Alexander and the Terrible, Horrible, No Good, Very Bad Day</title>
 <production_year>2009</production_year>
 <producer>Moritz, Neal H.</producer>
 <distributor>Columbia Pictures [us]</distributor>
 <distributor>Columbia Pictures [us]</distributor>
</movie>
<movie>
 <title>Blood and Donuts</title>
 <production_year>2000</production_year>
 <actor>Kauffmann, Matthew</actor>
 <actor>Benedetto, Marisa</actor>
 <producer>Schwarz, Jeffrey</producer>
 <director>Schwarz, Jeffrey</director>
</movie>
<movie>
 <title>Alexander and the Terrible, Horrible, No Good, Very Bad Day</title>
 <production_year>2009</production_year>
 <producer>Moritz, Neal H.</producer>
 <distributor>Columbia Pictures [us]</distributor>
 <distributor>Columbia Pictures [us]</distributor>
</movie>
</imdb>

As you can see, there's two "Alexander and the Terrible, Horrible, No Good, Very Bad Day" movie title inside . I want to remove all duplicated elements (like this example there's others along the xml), and at the end, create a new xml file with no duplicate values.

The desired new xml should look like this:

<imdb>
    <movie>
     <title>Alexander and the Terrible, Horrible, No Good, Very Bad Day</title>
     <production_year>2009</production_year>
     <producer>Moritz, Neal H.</producer>
     <distributor>Columbia Pictures [us]</distributor>
     <distributor>Columbia Pictures [us]</distributor>
    </movie>
    <movie>
     <title>Blood and Donuts</title>
     <production_year>2000</production_year>
     <actor>Kauffmann, Matthew</actor>
     <actor>Benedetto, Marisa</actor>
     <producer>Schwarz, Jeffrey</producer>
     <director>Schwarz, Jeffrey</director>
    </movie>
</imdb>

I saw this example from here Python remove duplicate elements from xml tree but not worked for me. Code I tried it's no mine. I did some changes to modify for what I need but didn't work. Can you guys help me on how to do this? Thank you for your help.

import xml.etree.ElementTree as ET

path = 'imdb.xml'

tree = ET.parse(path)
root = tree.getroot()
prev = None

def elements_equal(e1, e2):
    if type(e1) != type(e2):
        return False
    if e1.tag != e2.tag: return False
    if e1.text != e2.text: return False
    if e1.tail != e2.tail: return False
    if e1.attrib != e2.attrib: return False
    if len(e1) != len(e2): return False
    return all([elements_equal(c1, c2) for c1, c2 in zip(e1, e2)])

for page in root:                     # iterate over pages
    elems_to_remove = []
    for elem in page:
        if elements_equal(elem, prev):
            print("found duplicate: %s" % elem.text)   # equal function works well
            elems_to_remove.append(elem)
            continue
        prev = elem
    for elem_to_remove in elems_to_remove:
        page.remove(elem_to_remove)
# [...]
tree.write("new_imdb.xml")

UPDATE !

I tried using xslt. It worked for my example, but when I run to the whole imdb.xml it's not working. It's not removing duplicate entries. Any help?

enter image description here

That's the code I used:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:key name="imdbMovie" match="movie" use="concat(title,' ',production_year)"/>

<xsl:template match="imdb">
    <xsl:copy>
        <xsl:for-each select="movie[count(. | key('imdbMovie',concat(title,' ',production_year))[1]) = 1]">
            <xsl:copy-of select="."/>
        </xsl:for-each>
    </xsl:copy>
</xsl:template>
</xsl:stylesheet>

Upvotes: 1

Views: 791

Answers (1)

dabingsou
dabingsou

Reputation: 2469

Here's a solution.

from simplified_scrapy import SimplifiedDoc, utils, req
xml = '''
<imdb>
<movie>
 <title>Alexander and the Terrible, Horrible, No Good, Very Bad Day</title>
 <production_year>2009</production_year>
 <producer>Moritz, Neal H.</producer>
 <distributor>Columbia Pictures [us]</distributor>
 <distributor>Columbia Pictures [us]</distributor>
</movie>
<movie>
 <title>Blood and Donuts</title>
 <production_year>2000</production_year>
 <actor>Kauffmann, Matthew</actor>
 <actor>Benedetto, Marisa</actor>
 <producer>Schwarz, Jeffrey</producer>
 <director>Schwarz, Jeffrey</director>
</movie>
<movie>
 <title>Alexander and the Terrible, Horrible, No Good, Very Bad Day</title>
 <production_year>2009</production_year>
 <producer>Moritz, Neal H.</producer>
 <distributor>Columbia Pictures [us]</distributor>
 <distributor>Columbia Pictures [us]</distributor>
</movie>
</imdb>
'''
doc = SimplifiedDoc(xml)
movies = doc.selects('movie')
dic = {}
for movie in movies:
  title = movie.title.text
  if dic.get(title): # Use dictionary to remove duplicate
    movie.remove() # Delete duplicate nodes
  else:
    dic[title]=True
print(doc.html)

Upvotes: 1

Related Questions