Removing duplicates from XML with python

Question

I have an automatically updated XML file with format:


  Movie 1
  pics/movies/3a6f22.jpg
  IMDB link
  Fri Dec  3 03:02:05 2018


  Movie 2
  pics/movies/ae4r12.jpg
  IMDB link
  Fri Dec  3 05:34:06 2018


  Movie 1
  pics/movies/3a6f22.jpg
  IMDB link
  Sat Dec  4 12:04:06 2018


  Movie 3
  pics/movies/3f44j2.jpg
  IMDB link
  Sat Dec 4  14:04:07 2018

My desired output would be:


  Movie 1
  pics/movies/3a6f22.jpg
  IMDB link
  Fri Dec  3 03:02:05 2018


  Movie 2
  pics/movies/ae4r12.jpg
  IMDB link
  Fri Dec  3 05:34:06 2018


  Movie 3
  pics/movies/3f44j2.jpg
  IMDB link
  Sat Dec 4  14:04:07 2018

That is being read by javascript and php to make a list with CSS. I am trying to filter out any duplicates (e.g. the last entry titled Movie 1). I've searched and found some xsl / xslt solutions that I couldn't get to function properly. My problem is that I would like to remove any duplicate entries with the same title, but the summary, link, or time do not need to match.

I have tried:

from lxml import etree

data = open('xmlparse.xsl')
xslt_content = data.read()
xslt_root = etree.XML(xslt_content)
dom = etree.parse("movies.old.xml")
transform = etree.XSLT(xslt_root)
result = transform(dom)
f = open('movies.new.xml', 'w')
f.write(str(result))
f.close()

which pulls from this .xsl

no effect though, my new output file stays empty.

I have also tried using unique_everseen but that deletes data like and , rearranges time attributes to the end of the file, etc.. without mercy)

Parfait · Accepted Answer

Consider using XSLT 1.0's grouping method of the Muenchian Method. Below script and demo assumes your root node is named root:

XSLT

XSLT Demo

Python

from lxml import etree

# LOAD XML AND XSLT
dom = etree.parse('movies_old.xml')
xsl = etree.parse('xslt_script.xsl')

# TRANSFORM XML
transform = etree.XSLT(xsl)
result = transform(dom)

# SAVE OUTPUT TO FILE
with open('movies_new.xml', 'wb') as f:
   f.write(result)

Removing duplicates from XML with python

Answers (1)

Related Questions