user1765862
user1765862

Reputation: 14155

What's the fastest way to find and delete duplicate nodes inside XML?

XML file has structure like this

<Nodes>
   <Node> one </Node>
   <Node> two </Node>
   <Node> three </Node>
   <Node> three </Node>
</Nodes>

Since xml file has more than 30000 nodes I'm looking for fastest way to find and delete duplicate nodes.

How would you do it?

Upvotes: 0

Views: 1837

Answers (2)

Michael Kay
Michael Kay

Reputation: 163458

Try an XSLT 2.0 transformation:

<Nodes xmlns:xsl="http://www.w3..org/1999/XSL/Transform" xsl:version="2.0">
 <xsl:for-each-group select="/Nodes/Node" group-by=".">
  <xsl:copy-of select="current-group()[1]"/>
 </xsl:for-each-group>
</Nodes>

You can run that from C# using Saxon or XmlPrime.

Upvotes: 1

Selman Gen&#231;
Selman Gen&#231;

Reputation: 101701

You could use a HashSet :

var values = new HashSet<string>();
var xmlDocument = XDocument.Load("path");

foreach(var node in xmlDocument.Root.Elements("Node").ToList())
{
   if(!values.Add((string)node)) 
       node.Remove();
}

xmlDocument.Save("newpath");

Another way is to implement an IEqualityComparer for XElement class then use Distinct method.

Upvotes: 7

Related Questions