Reputation: 1147

marklogic remove duplicate node/element

I have several thousand documents that have duplicate element nodes. How can I find and remove duplicate title elements in my XML files?

I use fn:distict-values() cause performance issues.

for example: 01.xml

<doc>
     <pdf>1</pdf>
     <title>Head First JavaScript</title>
     <title>Head First JavaScript</title>
</doc>

02.xml

<doc>
    <pdf>0</pdf>
    <title>Python: Programming Basics for Absolute Beginners </title>
    <title>Python: Programming Basics for Absolute Beginners </title>
</doc>

result: 01.xml

<doc>
     <pdf>1</pdf>
     <title>Head First JavaScript</title>

</doc>

02.xml

<doc>
    <pdf>0</pdf>
    <title>Python: Programming Basics for Absolute Beginners </title>

</doc>

Upvotes: 1

Answers (3)

Sudeep Rawat

Reputation: 239

Hi Please test attached code

    let $doc :=
<doc>
    <title>Head First JavaScript</title>
     <title>Head First JavaScript</title>
     <title>hellao</title>
     <title>hello</title>
     <title>hello</title>
     <title>Python: Programming Basics for Absolute Beginners </title>
     <title>ahello</title>
     <title>Python: Programming Basics for Absolute Beginners </title>
</doc>

for $data in $doc//title[not(. = preceding-sibling::node())]
return $data

Upvotes: 1

Mads Hansen

Reputation: 66783

One of the easiest ways to remove duplicate title elements would be with an XSLT transformation.

xquery version "1.0-ml";
declare variable $XSLT :=
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">

    <!-- This identity template copies all content by default -->
    <xsl:template match="node()|@*">
        <xsl:copy>
            <xsl:apply-templates select="node()|@*"/>
        </xsl:copy>
    </xsl:template>

    <!--This template matches (and removes) the duplicate title elements 
        You could match more generically on any element * or add other match criteria
       -->
    <xsl:template match="title[text() = preceding-sibling::title[1]/text()]"/>

</xsl:stylesheet>;

xdmp:xslt-eval($XSLT, 
  <doc>
    <pdf>0</pdf>
    <title>Python: Programming Basics for Absolute Beginners </title>
    <title>Python: Programming Basics for Absolute Beginners </title>
  </doc>
)

If it's just several thousand documents, then you might be able to transform and save them all in one module execution.

Otherwise, applying the transform in a CoRB job as @hunterhacker suggests would ensure that you don't need to worry about timeouts by splitting up the work into individual executions.

Upvotes: 0

hunterhacker

Reputation: 7132

I suggest you run a CORB job and have each doc get processed individually. Then the exact code you run for each document won't matter all that much so long as it does the work. It's the kind of thing you'll let run overnight if you have a massive dataset.

Upvotes: 0

marklogic remove duplicate node/element

Answers (3)

Related Questions