Reputation: 1331
Here is my XML file: it contains a duplicated element <houseNum>0</houseNum>
.
<?xml version="1.0" encoding="utf-8"?>
<ArrayOfHouse>
<XmlForm>
<houseNum>0</houseNum>
<plan1>
<coord>
<X> 1.2 </X>
<Y> 2.1 </Y>
<Z> 3.0 </Z>
</coord>
<color>
<R> 255 </R>
<G> 0 </G>
<B> 0 </B>
</color>
</plan1>
<plan2>
<coord>
<X> 21.2 </X>
<Y> 22.1 </Y>
<Z> 31.0 </Z>
</coord>
<color>
<R> 255 </R>
<G> 0 </G>
<B> 0 </B>
</color>
</plan2>
</XmlForm>
<XmlForm>
<houseNum>0</houseNum>
<plan1>
<coord>
<X> 1.2 </X>
<Y> 2.1 </Y>
<Z> 3.0 </Z>
</coord>
<color>
<R> 255 </R>
<G> 0 </G>
<B> 0 </B>
</color>
</plan1>
<plan2>
<coord>
<X> 21.2 </X>
<Y> 22.1 </Y>
<Z> 31.0 </Z>
</coord>
<color>
<R> 255 </R>
<G> 0 </G>
<B> 0 </B>
</color>
</plan2>
</XmlForm>
<XmlForm>
<houseNum>1</houseNum>
<plan1>
<coord>
<X> 11.2 </X>
<Y> 12.1 </Y>
<Z> 13.0 </Z>
</coord>
<color>
<R> 255 </R>
<G> 255 </G>
<B> 0 </B>
</color>
</plan1>
<plan2>
<coord>
<X> 211.2 </X>
<Y> 212.1 </Y>
<Z> 311.0 </Z>
</coord>
<color>
<R> 255 </R>
<G> 0 </G>
<B> 255 </B>
</color>
</plan2>
</XmlForm>
</ArrayOfHouse>
In my case, there are two type of duplications:
1) If the duplicated elements are successive, here is my code to remove the duplicated element, I just compare the element[i] and element[i+1], if these elemet[i].text==element[i+1].text, I delete element[i+1]
from lxml import etree
def Remove_Duplication_XML(xml_file):
base_name = os.path.basename(xml_file)
start_time = time.time()
tree = etree.parse(xml_file)
# remove duplicate skeletons
root = tree.getroot()
elementlist = [e for e in root.iter('houseNum')]
numframes=[x.text for x in elementlist]
print(numframes)
for index_element in range(1, len(elementlist)):
try:
if elementlist[index_element].text == elementlist[index_element - 1].text:
elementlist[index_element].getparent().remove(elementlist[index_element])
print(elementlist[index_element].text)
except:
print(' except ')
# String xml without duplication
file = etree.tostring(root).decode("utf-8")
print(file)
2) If the duplicated elements are not successive, I am looking for a line of work to do it. Any help ?
Upvotes: 1
Views: 4920
Reputation: 107687
Consider XSLT, the special-purpose language designed to transform XML files (analoguous to using SQL, also special-purpose, to query databases). And because you already use Python's lxml
you can seamlessly run such a script without a single for
loop or if
logic to remove duplicates anywhere in the document.
Specifically, run the Muenchian Grouping, an XSLT 1.0 method, to index your XML document by the houseNum using <xsl:key>
and then return distinct groupings. With an added bonus, below XSLT even removes the white spaces in text nodes with pretty print indentation:
XSLT (save as .xsl file, a special .xml file)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes" method="xml"/>
<xsl:strip-space elements="*"/>
<xsl:key name="id" match="XmlForm" use="houseNum" />
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="XmlForm[generate-id() != generate-id(key('id', houseNum))]"/>
<xsl:template match="text()">
<xsl:value-of select="normalize-space(.)"/>
</xsl:template>
</xsl:stylesheet>
Python
import os
import lxml.etree as et
# LOAD XML AND XSL FILES
xml = et.parse('Source.xml')
xsl = et.parse('XSLTScript.xsl')
# TRANSFORM SOURCE
transform = et.XSLT(xsl)
result = transform(xml)
# PRINT RESULT TO SCREEN
print(result)
# SAVE RESULT TO FILE
with open('Output.xml', 'wb') as f:
f.write(result)
Output (notice text values are trimmed of empty space)
<?xml version="1.0"?>
<ArrayOfHouse>
<XmlForm>
<houseNum>0</houseNum>
<plan1>
<coord>
<X>1.2</X>
<Y>2.1</Y>
<Z>3.0</Z>
</coord>
<color>
<R>255</R>
<G>0</G>
<B>0</B>
</color>
</plan1>
<plan2>
<coord>
<X>21.2</X>
<Y>22.1</Y>
<Z>31.0</Z>
</coord>
<color>
<R>255</R>
<G>0</G>
<B>0</B>
</color>
</plan2>
</XmlForm>
<XmlForm>
<houseNum>1</houseNum>
<plan1>
<coord>
<X>11.2</X>
<Y>12.1</Y>
<Z>13.0</Z>
</coord>
<color>
<R>255</R>
<G>255</G>
<B>0</B>
</color>
</plan1>
<plan2>
<coord>
<X>211.2</X>
<Y>212.1</Y>
<Z>311.0</Z>
</coord>
<color>
<R>255</R>
<G>0</G>
<B>255</B>
</color>
</plan2>
</XmlForm>
</ArrayOfHouse>
Upvotes: 4