Reputation: 1147
I have several thousand documents that have duplicate element nodes. How can I find and remove duplicate title
elements in my XML files?
I use fn:distict-values()
cause performance issues.
for example: 01.xml
<doc>
<pdf>1</pdf>
<title>Head First JavaScript</title>
<title>Head First JavaScript</title>
</doc>
02.xml
<doc>
<pdf>0</pdf>
<title>Python: Programming Basics for Absolute Beginners </title>
<title>Python: Programming Basics for Absolute Beginners </title>
</doc>
result: 01.xml
<doc>
<pdf>1</pdf>
<title>Head First JavaScript</title>
</doc>
02.xml
<doc>
<pdf>0</pdf>
<title>Python: Programming Basics for Absolute Beginners </title>
</doc>
Upvotes: 1
Views: 180
Reputation: 239
Hi Please test attached code
let $doc :=
<doc>
<title>Head First JavaScript</title>
<title>Head First JavaScript</title>
<title>hellao</title>
<title>hello</title>
<title>hello</title>
<title>Python: Programming Basics for Absolute Beginners </title>
<title>ahello</title>
<title>Python: Programming Basics for Absolute Beginners </title>
</doc>
for $data in $doc//title[not(. = preceding-sibling::node())]
return $data
Upvotes: 1
Reputation: 66783
One of the easiest ways to remove duplicate title
elements would be with an XSLT transformation.
xquery version "1.0-ml";
declare variable $XSLT :=
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<!-- This identity template copies all content by default -->
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
<!--This template matches (and removes) the duplicate title elements
You could match more generically on any element * or add other match criteria
-->
<xsl:template match="title[text() = preceding-sibling::title[1]/text()]"/>
</xsl:stylesheet>;
xdmp:xslt-eval($XSLT,
<doc>
<pdf>0</pdf>
<title>Python: Programming Basics for Absolute Beginners </title>
<title>Python: Programming Basics for Absolute Beginners </title>
</doc>
)
If it's just several thousand documents, then you might be able to transform and save them all in one module execution.
Otherwise, applying the transform in a CoRB job as @hunterhacker suggests would ensure that you don't need to worry about timeouts by splitting up the work into individual executions.
Upvotes: 0
Reputation: 7132
I suggest you run a CORB job and have each doc get processed individually. Then the exact code you run for each document won't matter all that much so long as it does the work. It's the kind of thing you'll let run overnight if you have a massive dataset.
Upvotes: 0