Reputation: 1
I have a large file and have found that some elements are in it twice, now I would like to delete the duplicates. any ideas what I could do? Would appreciate any help!
The xml looks like this:
<Toptag>
<text coordinates="" country="" date="yyyy-mm-dd" lang="" place="xyc" time="" id=" 123" name="xyz" >
<div>
This is text
</div>
</text>
<text coordinates="" country="" date="yyyy-mm-dd" lang="" place="xyc"
time="" id=" 124" name="xyz" >
<div>
This is text
</div>
</text>
<text coordinates="" country="" date="yyyy-mm-dd" lang="" place="xyc" time="" id=" 123" name="xyz" >
<div>
This is text
</div>
</text>
....
</toptag>
In the duplicates, everything from <text...............> <div> </div> </text>
is exactly the same!
Thank you!!!!!!
Upvotes: 0
Views: 118
Reputation: 163322
If you can define a function f:signature(element(text)) that returns the same value for two elements if and only if they are considered equal, then you can use XSLT 2.0 grouping to eliminate the duplicates:
<xsl:for-each-group select="text" group-by="f:signature(.)">
<xsl:copy-of select="current-group()[1]"/>
</xsl:for-each-group>
If the elements have very different structure then writing a signature function might be difficult. But if they are all very similar, as your example seems to suggest, then you could use, for example
<xsl:function name="f:signature" as="xs:string">
<xsl:param name="e" as="element(text)"/>
<xsl:sequence select="string-join($e!(@coordinates, @country, @date, @lang, @place, string(.)), '|')"/>
</xsl:function>
Note: I used the XSLT 3.0 "!" operator because you don't want the attributes sorted into document order (document order of attributes is unpredictable). In 2.0, where "!" isn't available, you can spell it out as ($e/@coordinates, $e/@country, $e/@date, ...)
.
Upvotes: 1
Reputation: 167506
Assuming you use at least XSLT 2 you have access to the deep-equal
function https://www.w3.org/TR/xpath-functions/#func-deep-equal and can therefore write an empty template
<xsl:template match="Toptag/text[some $sib in preceding-sibling::text satisfies deep-equal(., $sib)]"/>
together with the identity transformation (e.g. in XSLT 3 using the appropriate xsl:mode
declaration or in XSLT 2 by spelling it out):
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="3.0">
<xsl:mode on-no-match="shallow-copy"/>
<xsl:template match="Toptag/text[some $sib in preceding-sibling::text satisfies deep-equal(., $sib)]"/>
</xsl:stylesheet>
that way those text
elements that have a preceding sibling text
that is deep equal are not copied: https://xsltfiddle.liberty-development.net/94hvTzF
Obviously the condition in the predicate could be adjusted to check all preceding nodes as well.
Upvotes: 1