Reputation: 53
Before I came here, I've searched for similar possible questions to help me but, none of them worked for my problem.
I have a xml file which has informations about movies. What I want is to remove duplicated values of <title>
. Bellow there's some part of the xml:
<imdb>
<movie>
<title>Alexander and the Terrible, Horrible, No Good, Very Bad Day</title>
<production_year>2009</production_year>
<producer>Moritz, Neal H.</producer>
<distributor>Columbia Pictures [us]</distributor>
<distributor>Columbia Pictures [us]</distributor>
</movie>
<movie>
<title>Blood and Donuts</title>
<production_year>2000</production_year>
<actor>Kauffmann, Matthew</actor>
<actor>Benedetto, Marisa</actor>
<producer>Schwarz, Jeffrey</producer>
<director>Schwarz, Jeffrey</director>
</movie>
<movie>
<title>Alexander and the Terrible, Horrible, No Good, Very Bad Day</title>
<production_year>2009</production_year>
<producer>Moritz, Neal H.</producer>
<distributor>Columbia Pictures [us]</distributor>
<distributor>Columbia Pictures [us]</distributor>
</movie>
</imdb>
As you can see, there's two "Alexander and the Terrible, Horrible, No Good, Very Bad Day" movie title inside . I want to remove all duplicated elements (like this example there's others along the xml), and at the end, create a new xml file with no duplicate values.
The desired new xml should look like this:
<imdb>
<movie>
<title>Alexander and the Terrible, Horrible, No Good, Very Bad Day</title>
<production_year>2009</production_year>
<producer>Moritz, Neal H.</producer>
<distributor>Columbia Pictures [us]</distributor>
<distributor>Columbia Pictures [us]</distributor>
</movie>
<movie>
<title>Blood and Donuts</title>
<production_year>2000</production_year>
<actor>Kauffmann, Matthew</actor>
<actor>Benedetto, Marisa</actor>
<producer>Schwarz, Jeffrey</producer>
<director>Schwarz, Jeffrey</director>
</movie>
</imdb>
I saw this example from here Python remove duplicate elements from xml tree but not worked for me. Code I tried it's no mine. I did some changes to modify for what I need but didn't work. Can you guys help me on how to do this? Thank you for your help.
import xml.etree.ElementTree as ET
path = 'imdb.xml'
tree = ET.parse(path)
root = tree.getroot()
prev = None
def elements_equal(e1, e2):
if type(e1) != type(e2):
return False
if e1.tag != e2.tag: return False
if e1.text != e2.text: return False
if e1.tail != e2.tail: return False
if e1.attrib != e2.attrib: return False
if len(e1) != len(e2): return False
return all([elements_equal(c1, c2) for c1, c2 in zip(e1, e2)])
for page in root: # iterate over pages
elems_to_remove = []
for elem in page:
if elements_equal(elem, prev):
print("found duplicate: %s" % elem.text) # equal function works well
elems_to_remove.append(elem)
continue
prev = elem
for elem_to_remove in elems_to_remove:
page.remove(elem_to_remove)
# [...]
tree.write("new_imdb.xml")
UPDATE !
I tried using xslt. It worked for my example, but when I run to the whole imdb.xml it's not working. It's not removing duplicate entries. Any help?
That's the code I used:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:key name="imdbMovie" match="movie" use="concat(title,' ',production_year)"/>
<xsl:template match="imdb">
<xsl:copy>
<xsl:for-each select="movie[count(. | key('imdbMovie',concat(title,' ',production_year))[1]) = 1]">
<xsl:copy-of select="."/>
</xsl:for-each>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Upvotes: 1
Views: 791
Reputation: 2469
Here's a solution.
from simplified_scrapy import SimplifiedDoc, utils, req
xml = '''
<imdb>
<movie>
<title>Alexander and the Terrible, Horrible, No Good, Very Bad Day</title>
<production_year>2009</production_year>
<producer>Moritz, Neal H.</producer>
<distributor>Columbia Pictures [us]</distributor>
<distributor>Columbia Pictures [us]</distributor>
</movie>
<movie>
<title>Blood and Donuts</title>
<production_year>2000</production_year>
<actor>Kauffmann, Matthew</actor>
<actor>Benedetto, Marisa</actor>
<producer>Schwarz, Jeffrey</producer>
<director>Schwarz, Jeffrey</director>
</movie>
<movie>
<title>Alexander and the Terrible, Horrible, No Good, Very Bad Day</title>
<production_year>2009</production_year>
<producer>Moritz, Neal H.</producer>
<distributor>Columbia Pictures [us]</distributor>
<distributor>Columbia Pictures [us]</distributor>
</movie>
</imdb>
'''
doc = SimplifiedDoc(xml)
movies = doc.selects('movie')
dic = {}
for movie in movies:
title = movie.title.text
if dic.get(title): # Use dictionary to remove duplicate
movie.remove() # Delete duplicate nodes
else:
dic[title]=True
print(doc.html)
Upvotes: 1