Simmy
Simmy

Reputation: 11

Removing duplicates in a nested, cross-referenced list

I am aware that title might be misleading and common answer can be 'go search again', however I have searched stack overflow as much as I could and I didn't find an answer that would satisfy me.

Introduction

I need to create an algorithm that will remove all the similar items in a nested, cross-referenced list.

General rules:

Memory requirements as well as CPU occupancy are not specified - high performance is preferred. My current approach (brute-force method that compares every object with another one) is to slow and can take up to 1 hour of runtime. Prefered execution speed would be up to 1 minute. Also my way iterate collection infinite times as long, as same objects were found and it's using only single thread. I would love to get multi-threaded more deterministic approach.

Description

Consider following object definition:

public enum ClassType
{
  Type_Base,
  Type_A,
  Type_B,
  Type_C,
};

public class MyObject
{
  public string Name { get; }
  public uint ID { get; }
  public ClassType ClassType { get; }

  public uint ReferencedID { get; }
  public List<MyObject> Children { get; } = new List<MyObject>( );
}

And a collections indexed with a ID property of MyObject, for fast access:

Dictionary<uint, MyObject> Preserved; // Top-most objects, has to be preserved
Dictionary<uint, MyObject> AllObjects; // Dictionary of all objects

Additionaly, consider following comparison method:

public static bool AreObjectsSame( MyObject l, MyObject r, Dictionary<uint, MyObject> allObjects )
{
  // Null-Check
  if ( l is null ) throw new ArgumentNullException( nameof( l ) );
  if ( r is null ) throw new ArgumentNullException( nameof( r ) );
  if ( allObjects is null ) throw new ArgumentNullException( nameof( allObjects ) );

  // Compare class type and name
  if ( l.ClassType != r.ClassType ) return false;
  if ( l.Name != null && r.Name != null && !l.Name.Equals( r.Name, StringComparison.InvariantCulture ) ) return false;
  
  // Compare referenced ID objects
  if ( l.ReferencedID != 0 && r.ReferencedID != 0 )
  {
    if ( !AreObjectsSame( allObjects[ l.ReferencedID ], allObjects[ r.ReferencedID ], allObjects ) ) return false;
  }

  // Compare children objects
  if ( l.Children.Count != r.Children.Count ) return false;
  for ( int i = 0; i < l.Children.Count; ++i )
  {
    if ( !AreObjectsSame( l.Children[ i ], r.Children[ i ], allObjects ) ) return false;
  }

  return true;
}

Now let's examine a chain of objects. Objects having children works in a similar manner, therefore I will skip them for sake of explanation simplicity.

MyObject ("Name_A", Type_A) -> MyObject ("Name_B", Type_B) -> MyObject ("Name_C", Type_A) -> MyObject ("Name_D", Type_Base)
MyObject ("Name_E", Type_C) -> MyObject ("Name_F", Type_C) -> MyObject ("Name_C", Type_A) -> MyObject ("Name_D", Type_Base)
MyObject ("Name_G", Type_A) -> MyObject ("Name_D", Type_Base)

As you can see, we have 10 objects placed in AllObjects dictionary. In each reference chain we have objects ("Name_C", Type_A) and ("Name_D", Type_Base). Since objects ("Name_D", Type_Base) are equal (they have the same type and name), objects ("Name_C", Type_A) are also equal (they have the same type and name plus their referenced objects are same).

Now, I we can optimize above chain of objects into something like this:

MyObject ("Name_A", Type_A) -> MyObject ("Name_B", Type_B) -> MyObject ("Name_C", Type_A) -> MyObject ("Name_D", Type_Base)
MyObject ("Name_E", Type_C) -> MyObject ("Name_F", Type_C) ->-|
MyObject ("Name_G", Type_A) -------------------------------->-/ 

This lead us to removing 3 duplicated nodes and rearanging reference id for two of them.

Possible Solution

One of the possible solution would be to precalculate for each element its own hash code and use this hash to compare elements in a brute force manner. If comparison is true, then perform field-by-field comparison (to eliminate possible collisions in hash codes).

Is there any better way of removing duplicates? How do I do it?

Upvotes: 1

Views: 278

Answers (1)

cly
cly

Reputation: 708

Some fast suggestions:

  1. yes, precalculate hashcode and do in-depth comparison only on items having same hashcode
  2. dont examine B against A after A was once examined against B, take items in some order and do examine item at index X only against indices greater than X
  3. do not remove items during comparison but collect the items to-be-left in separate collection and drop the original one at the end
  4. use well indexable collections like array during an algorithm which goes through all items.

But first of all use some profiler tools like Redgate's Performance Profiler. A lot of times I was "optimizing" the algorithm using my intuition without any measurable success but making it much less understandable. After trying a profiler it turned out that a who-the-hell-would-think-of-that code, e.g. string operations, overused object instantiations take much more time than my algorithm! The profiler can point you to the code which really needs optimalization :)

Upvotes: 0

Related Questions