Mark
Mark

Reputation: 981

Replace and Merge "Duplicates" From a HashSet

I have a HashSet<T> of a hierarchical object in C# 4.0. The main key is an int, but there are occasionally secondary keys which are duplicated. I would like to merge the entries with duplicated secondary keys. In this example the secondary key is Name:

struct Element
{
  int ID;
  string Name;
  List<int> Children;
  List<int> Parents;

  public override int GetHashCode()
  {
    return ID;
  }
}

HashSet<Element> elements = new HashSet<Element>();

// Example Elements
elements.Add(1, "Apple", Children = {10, 11, 12}, Parents = {13,14,15});
elements.Add(2, "Banana", Children = {20, 21, 22}, Parents = {23,24,25});
elements.Add(3, "Apple", Children = {30, 31, 32}, Parents = {33,34,35});
elements.Add(4, "Food", Children = {1, 2, 3}, Parents = {});

The goal is to remove the 3rd entry {3, "Apple",...} and then update and merge the Parent and Children references in the other remaining elements; the end result should be this:

{ 1, "Apple", Children = { 10, 11, 12, 30, 31, 32 }, Parents = { 13,14,15, 33, 34, 35 }}
{ 2, "Banana", Children = { 20, 21, 22 }, Parents = { 23,24,25 }}
{ 4, "Food", Children = {1, 2}, Parents = {} }

Here is what I have so far, but I can't figure out the best way to update the HashSet in place. I start by copying the HashSet so that I can do deletes while iterating. First I find the duplicates. If there are duplicates I want to update and them remove them from the copy. That is where I get stuck. Once I've updated the duplicates, I want to remove them, and prevent processing them again with a skip list:

var copy = new HashSet<Element>(Elements);
HashSet<int> skip = new HashSet<int>();
foreach (var e in Elements)
{
  if (!skip.Contains(e.ID)
  {
    var duplicates = Elements.Where(x => e.Name == x.Name && e.ID != x.ID);
    if (duplicates.Any())
    {           
      foreach (var d in duplicates)
      {
        // Iterate copy and update Parent and Children references
        // How do I do this part? 
      }

      // Remove the duplicates from the copied list
      copy.RemoveWhere(x => duplicates.Select(x => x.ID)
                                      .Contains(x.ID));

      // Don't process the duplicates again
      skip.UnionWith(duplicates);
    } 
  }
}
return copy;

I'm stuck at this point. Also, is there a slick way to do this with Linq?

Update: The list is already like this, I don't have control over the initial contents. I suppose I could create a new wrapper that has a better Add method to prevent duplication.

Upvotes: 2

Views: 3384

Answers (3)

Fung
Fung

Reputation: 3558

If I understand you correctly, you want to:

  1. Remove elements with the same Name
  2. Merge the removed elements' Children and Parents lists to the remaining element
  3. In Children and Parents lists, replace references to the removed IDs with the remaining element's ID

These can be accomplished with the following code:

// Find all duplicated elements and remove them
var duplicates = Elements.GroupBy(x => x.Name)
                         .Where(x => x.Count() > 1)
                         .SelectMany(x => x.OrderBy(e => e.ID)
                                           .Skip(1)
                                           .Select(e => new { Element = e, NewID = x.Min(y => y.ID) }))
                         .ToDictionary(x => x.Element.ID, x => new { x.Element, x.NewID });
Elements.ExceptWith(duplicates.Values.Select(x => x.Element));

// Update the Children and Parents of each remaining element
foreach (var element in Elements)
{
    var removed = duplicates.Where(x => x.Value.Element.Name == element.Name);

    var mergedChildren = element.Children.Union(removed.SelectMany(x => x.Value.Element.Children))
                                         .Select(x => duplicates.ContainsKey(x) ? duplicates[x].NewID : x)
                                         .Distinct().ToList();
    element.Children.Clear();
    element.Children.AddRange(mergedChildren);


    var mergedParents = element.Parents.Union(removed.SelectMany(x => x.Value.Element.Parents))
                                       .Select(x => duplicates.ContainsKey(x) ? duplicates[x].NewID : x)
                                       .Distinct().ToList();
    element.Parents.Clear();
    element.Parents.AddRange(mergedParents);
}

Upvotes: 1

horgh
horgh

Reputation: 18563

You could try this:

var temp = Elements.GroupBy(e => e.Name)
                   .Select(g => new Element
                   {
                       ID = g.OrderBy(e => e.ID).First().ID,
                       Name = g.Key,
                       Children = g.SelectMany(e => e.Children).ToList(),
                       Parents = g.SelectMany(e => e.Parents).ToList()
                   });
var duplicates = Elements.Where(e => !temp.Any(t => t.ID == e.ID))
                         .Select(e => e.ID)
                         .Distinct();
Elements = new HashSet<Element>(temp);
foreach (Element e in Elements)
{
    e.Children.RemoveAll(i => duplicates.Contains(i));
    e.Parents.RemoveAll(i => duplicates.Contains(i));
}

As far as I understood you only need to group all elements by the Name, then choose the lowest ID and join Children and Parents. Clearly this is done by this query.

Upvotes: 2

jcolebrand
jcolebrand

Reputation: 16035

Try adding this single field element.

struct Element
{
  int ID;
  string Name;
  List<int> Children;
  List<int> Parents;
  Bool duplicate;
}

HashSet<Element> Elements = new HashSet();

// Example Elements
Elements.Add(1, "Apple", Children = {10, 11, 12}, Parents = {13,14,15}, duplicate = false);
Elements.Add(2, "Banana", Children = {20, 21, 22}, Parents = {23,24,25}, duplicate = false);
Elements.Add(3, "Apple", Children = {30, 31, 32}, Parents = {33,34,35}, duplicate = false);
Elements.Add(4, "Food", Children = {1, 2, 3}, Parents = {}, duplicate = false);

As you iterate on your copy, mark "duplicate" to true. Or add a "deleted" element so you don't reprocess. Or whatever. The point is, add one more element. You can always copy the element and create new when adding.

To add to Sina's comments earlier, you could have a key like thus:

class ElementKey {
  int ID;
  string Name;
}

class Element {
  ElementKey Key;
  List<int> Children;
  List<int> Parents;
  ProcessFlagSet flags;
}

class ProcessFlagSet {
  bool Processed;
  bool Duplicate;
}

Dictionary<ElementKey,Element> ...

And then you can remove all the elements from ProcessFlagSet later for easy refactoring needs. They'll break compilation till they're removed if you don't need them.

Lastly, I want to recommend creating your own Add method here. I want you to consider passing in the element to be added, then check to see if the key exists on add. This saves you a step.

Upvotes: 2

Related Questions