Jeanne
Jeanne

Reputation: 1331

How to delete duplicated elements in XML file

Here is my XML file: it contains a duplicated element <houseNum>0</houseNum>.

<?xml version="1.0" encoding="utf-8"?>
<ArrayOfHouse>
<XmlForm>
<houseNum>0</houseNum>
 <plan1> 
  <coord>
    <X> 1.2  </X>
    <Y> 2.1  </Y>
    <Z> 3.0  </Z>
  </coord>
  <color> 
    <R> 255 </R>
    <G> 0   </G>
    <B> 0   </B>
  </color>
 </plan1>
 <plan2>
  <coord>  
    <X> 21.2  </X>
    <Y> 22.1  </Y>
    <Z> 31.0  </Z>
  </coord>
  <color> 
    <R> 255 </R>
    <G> 0   </G>
    <B> 0   </B>
</color>
 </plan2> 
</XmlForm>
<XmlForm>
<houseNum>0</houseNum>
 <plan1> 
  <coord>
    <X> 1.2  </X>
    <Y> 2.1  </Y>
    <Z> 3.0  </Z>
  </coord>
  <color> 
    <R> 255 </R>
    <G> 0   </G>
    <B> 0   </B>
  </color>
 </plan1>
 <plan2>
  <coord>  
    <X> 21.2  </X>
    <Y> 22.1  </Y>
    <Z> 31.0  </Z>
  </coord>
  <color> 
    <R> 255 </R>
    <G> 0   </G>
    <B> 0   </B>
</color>
 </plan2> 
</XmlForm>

<XmlForm>
<houseNum>1</houseNum>
 <plan1> 
  <coord>
    <X> 11.2  </X>
    <Y> 12.1  </Y>
    <Z> 13.0  </Z>
  </coord>
  <color> 
    <R> 255 </R>
    <G> 255   </G>
    <B> 0   </B>
  </color>
 </plan1>
 <plan2>
  <coord>  
    <X> 211.2  </X>
    <Y> 212.1  </Y>
    <Z> 311.0  </Z>
  </coord>
  <color> 
    <R> 255 </R>
    <G> 0   </G>
    <B> 255   </B>
</color>
 </plan2> 
</XmlForm>
</ArrayOfHouse>

In my case, there are two type of duplications:

1) If the duplicated elements are successive, here is my code to remove the duplicated element, I just compare the element[i] and element[i+1], if these elemet[i].text==element[i+1].text, I delete element[i+1]

from lxml import etree
def Remove_Duplication_XML(xml_file):
    base_name = os.path.basename(xml_file)
    start_time = time.time()
    tree = etree.parse(xml_file)

    # remove duplicate skeletons
    root = tree.getroot()
    elementlist = [e for e in root.iter('houseNum')]
    numframes=[x.text for x in elementlist]
    print(numframes)
    for index_element in range(1, len(elementlist)):

        try:
            if elementlist[index_element].text == elementlist[index_element - 1].text:
                elementlist[index_element].getparent().remove(elementlist[index_element])
                print(elementlist[index_element].text)

        except:
            print(' except  ')

    # String xml without duplication
    file = etree.tostring(root).decode("utf-8")
    print(file)

2) If the duplicated elements are not successive, I am looking for a line of work to do it. Any help ?

Upvotes: 1

Views: 4920

Answers (1)

Parfait
Parfait

Reputation: 107687

Consider XSLT, the special-purpose language designed to transform XML files (analoguous to using SQL, also special-purpose, to query databases). And because you already use Python's lxml you can seamlessly run such a script without a single for loop or if logic to remove duplicates anywhere in the document.

Specifically, run the Muenchian Grouping, an XSLT 1.0 method, to index your XML document by the houseNum using <xsl:key> and then return distinct groupings. With an added bonus, below XSLT even removes the white spaces in text nodes with pretty print indentation:

XSLT (save as .xsl file, a special .xml file)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes" method="xml"/>
  <xsl:strip-space elements="*"/>

  <xsl:key name="id" match="XmlForm" use="houseNum" />

  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="XmlForm[generate-id() != generate-id(key('id', houseNum))]"/>

  <xsl:template match="text()">
    <xsl:value-of select="normalize-space(.)"/>
  </xsl:template>

</xsl:stylesheet>

Python

import os
import lxml.etree as et

# LOAD XML AND XSL FILES
xml = et.parse('Source.xml')
xsl = et.parse('XSLTScript.xsl')

# TRANSFORM SOURCE
transform = et.XSLT(xsl)
result = transform(xml)

# PRINT RESULT TO SCREEN
print(result)

# SAVE RESULT TO FILE
with open('Output.xml', 'wb') as f:
    f.write(result)

Output (notice text values are trimmed of empty space)

<?xml version="1.0"?>
<ArrayOfHouse>
  <XmlForm>
    <houseNum>0</houseNum>
    <plan1>
      <coord>
        <X>1.2</X>
        <Y>2.1</Y>
        <Z>3.0</Z>
      </coord>
      <color>
        <R>255</R>
        <G>0</G>
        <B>0</B>
      </color>
    </plan1>
    <plan2>
      <coord>
        <X>21.2</X>
        <Y>22.1</Y>
        <Z>31.0</Z>
      </coord>
      <color>
        <R>255</R>
        <G>0</G>
        <B>0</B>
      </color>
    </plan2>
  </XmlForm>
  <XmlForm>
    <houseNum>1</houseNum>
    <plan1>
      <coord>
        <X>11.2</X>
        <Y>12.1</Y>
        <Z>13.0</Z>
      </coord>
      <color>
        <R>255</R>
        <G>255</G>
        <B>0</B>
      </color>
    </plan1>
    <plan2>
      <coord>
        <X>211.2</X>
        <Y>212.1</Y>
        <Z>311.0</Z>
      </coord>
      <color>
        <R>255</R>
        <G>0</G>
        <B>255</B>
      </color>
    </plan2>
  </XmlForm>
</ArrayOfHouse>

Upvotes: 4

Related Questions