Removing duplicates in a nested, cross-referenced list

Question

I am aware that title might be misleading and common answer can be 'go search again', however I have searched stack overflow as much as I could and I didn't find an answer that would satisfy me.

Introduction

I need to create an algorithm that will remove all the similar items in a nested, cross-referenced list.

General rules:

Items can have children and can reference other items.
Object cannot reference another object and have children at the same time.
There can be a lot of objects in a list (up to 100k), there are ALWAYS up to 1 level of depth for children, but unknown, variable-length reference chains for each object.
Reference chains are not infinite, they usually have up to 5-15 elements and always ends up with an object of known ClassType field.
There can be 'dead chains' - an items that are not referenced by any of the objects but they reference other objects.
Children of objects are NOT a part of a list - they can reference another elements, but cannot be referenced on it's own.
In order to compare two objects their properties has to be same as well as their children (if they exists) and a referenced object (if it exists).
When comparing children or referenced object, same rules apply.
Unique ID is assigned to every object (this is NOT a hash code, rather than order of object creation) - two object will be same despite different id they have. There will be no objects with same ID.

Memory requirements as well as CPU occupancy are not specified - high performance is preferred. My current approach (brute-force method that compares every object with another one) is to slow and can take up to 1 hour of runtime. Prefered execution speed would be up to 1 minute. Also my way iterate collection infinite times as long, as same objects were found and it's using only single thread. I would love to get multi-threaded more deterministic approach.

Description

Consider following object definition:

public enum ClassType
{
  Type_Base,
  Type_A,
  Type_B,
  Type_C,
};

public class MyObject
{
  public string Name { get; }
  public uint ID { get; }
  public ClassType ClassType { get; }

  public uint ReferencedID { get; }
  public List Children { get; } = new List( );
}

And a collections indexed with a ID property of MyObject, for fast access:

Dictionary Preserved; // Top-most objects, has to be preserved
Dictionary AllObjects; // Dictionary of all objects

Additionaly, consider following comparison method:

public static bool AreObjectsSame( MyObject l, MyObject r, Dictionary allObjects )
{
  // Null-Check
  if ( l is null ) throw new ArgumentNullException( nameof( l ) );
  if ( r is null ) throw new ArgumentNullException( nameof( r ) );
  if ( allObjects is null ) throw new ArgumentNullException( nameof( allObjects ) );

  // Compare class type and name
  if ( l.ClassType != r.ClassType ) return false;
  if ( l.Name != null && r.Name != null && !l.Name.Equals( r.Name, StringComparison.InvariantCulture ) ) return false;
  
  // Compare referenced ID objects
  if ( l.ReferencedID != 0 && r.ReferencedID != 0 )
  {
    if ( !AreObjectsSame( allObjects[ l.ReferencedID ], allObjects[ r.ReferencedID ], allObjects ) ) return false;
  }

  // Compare children objects
  if ( l.Children.Count != r.Children.Count ) return false;
  for ( int i = 0; i < l.Children.Count; ++i )
  {
    if ( !AreObjectsSame( l.Children[ i ], r.Children[ i ], allObjects ) ) return false;
  }

  return true;
}

Now let's examine a chain of objects. Objects having children works in a similar manner, therefore I will skip them for sake of explanation simplicity.

MyObject ("Name_A", Type_A) -> MyObject ("Name_B", Type_B) -> MyObject ("Name_C", Type_A) -> MyObject ("Name_D", Type_Base)
MyObject ("Name_E", Type_C) -> MyObject ("Name_F", Type_C) -> MyObject ("Name_C", Type_A) -> MyObject ("Name_D", Type_Base)
MyObject ("Name_G", Type_A) -> MyObject ("Name_D", Type_Base)

As you can see, we have 10 objects placed in AllObjects dictionary. In each reference chain we have objects ("Name_C", Type_A) and ("Name_D", Type_Base). Since objects ("Name_D", Type_Base) are equal (they have the same type and name), objects ("Name_C", Type_A) are also equal (they have the same type and name plus their referenced objects are same).

Now, I we can optimize above chain of objects into something like this:

MyObject ("Name_A", Type_A) -> MyObject ("Name_B", Type_B) -> MyObject ("Name_C", Type_A) -> MyObject ("Name_D", Type_Base)
MyObject ("Name_E", Type_C) -> MyObject ("Name_F", Type_C) ->-|
MyObject ("Name_G", Type_A) -------------------------------->-/

This lead us to removing 3 duplicated nodes and rearanging reference id for two of them.

Possible Solution

One of the possible solution would be to precalculate for each element its own hash code and use this hash to compare elements in a brute force manner. If comparison is true, then perform field-by-field comparison (to eliminate possible collisions in hash codes).

Is there any better way of removing duplicates? How do I do it?

Removing duplicates in a nested, cross-referenced list

Answers (1)

Related Questions