Reputation: 3200

efficiently removing duplicate xml elements in c#

I have couple of XML files that contain lots of duplicate entries, such as these.

<annotations>
  <annotation value=",Clear,Outdoors" eventID="2">
    <image location="Location 1" />
    <image location="Location 2" />
    <image location="Location 2" />
  </annotation>

  <annotation value=",Not a problem,Gravel,Shopping" eventID="2">
    <image location="Location 3" />
    <image location="Location 4" />
    <image location="Location 5" />
    <image location="Location 5" />
    <image location="Location 5" />
  </annotation>
</annotations>

I want to remove the duplicate elements in the each of the child. The way I approached this is by copying all the elements to a list and then comparing them,

 foreach (var el in xdoc.Descendants("annotation").ToList())
   {
      foreach (var x in el.Elements("image").Attributes("location").ToList())
       {
           //add elements to a list
       }
   }

half way through I realized this is very inefficient and time consuming. I'm fairly new to XML, I was wondering if there are any built in methods in C# that I can use to remove duplicates?.

I tried using

if(!x.value.Distinct()) // can't convert collections to bool
    x.Remove();

But that doesn't work, neither does

if(x.value.count() > 1) // value.count returns the number of elements.
   x.Remove()

Upvotes: 1

Answers (3)

Tony Stark

Reputation: 771

using System.Xml.Linq;

XDocument xDoc = XDocument.Parse(xmlString);
xDoc.Root.Elements("annotation")
         .SelectMany(s => s.Elements("image")
                           .GroupBy(g => g.Attribute("location").Value)
                           .SelectMany(m => m.Skip(1))).Remove();

Upvotes: 6

Steven Evers

Reputation: 17206

There's a couple of things that you could do here. As well as the other answers so far, you can note that Distinct() has an overload that takes an IEqualityComparer. You could use something like this ProjectionEqualityComparer to do something like this:

var images = xdoc.Descendants("image")
    .Distinct(ProjectionEqualityComparer<XElement>.Create(xe => xe.Attributes("location").First().Value))

... which would give you all of the unique "image" elements that have unique location attributes.

Upvotes: 0

Flynn1179

Reputation: 12075

If your duplicates are always in this form, then you could do this with a bit of XSLT to remove duplicate nodes. The XSLT for this is:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="node()|@*">
    <xsl:copy>
      <xsl:apply-templates select="node()|@*"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="image[@location = preceding-sibling::image/@location]"/>
</xsl:stylesheet>

If it's something that can happen frequently, then it might be worth having that stylesheet loaded into a XslCompiledTransform instance.

Or you can simply get a list of all duplicate nodes using this XPath:

/annotations/annotation/image[@location = preceding-sibling::image/@location]

and remove them from their parent.

Upvotes: 0

efficiently removing duplicate xml elements in c#

Answers (3)

Related Questions