Reputation: 11
I am aware that title might be misleading and common answer can be 'go search again', however I have searched stack overflow as much as I could and I didn't find an answer that would satisfy me.
Introduction
I need to create an algorithm that will remove all the similar items in a nested, cross-referenced list.
General rules:
Memory requirements as well as CPU occupancy are not specified - high performance is preferred. My current approach (brute-force method that compares every object with another one) is to slow and can take up to 1 hour of runtime. Prefered execution speed would be up to 1 minute. Also my way iterate collection infinite times as long, as same objects were found and it's using only single thread. I would love to get multi-threaded more deterministic approach.
Description
Consider following object definition:
public enum ClassType
{
Type_Base,
Type_A,
Type_B,
Type_C,
};
public class MyObject
{
public string Name { get; }
public uint ID { get; }
public ClassType ClassType { get; }
public uint ReferencedID { get; }
public List<MyObject> Children { get; } = new List<MyObject>( );
}
And a collections indexed with a ID property of MyObject, for fast access:
Dictionary<uint, MyObject> Preserved; // Top-most objects, has to be preserved
Dictionary<uint, MyObject> AllObjects; // Dictionary of all objects
Additionaly, consider following comparison method:
public static bool AreObjectsSame( MyObject l, MyObject r, Dictionary<uint, MyObject> allObjects )
{
// Null-Check
if ( l is null ) throw new ArgumentNullException( nameof( l ) );
if ( r is null ) throw new ArgumentNullException( nameof( r ) );
if ( allObjects is null ) throw new ArgumentNullException( nameof( allObjects ) );
// Compare class type and name
if ( l.ClassType != r.ClassType ) return false;
if ( l.Name != null && r.Name != null && !l.Name.Equals( r.Name, StringComparison.InvariantCulture ) ) return false;
// Compare referenced ID objects
if ( l.ReferencedID != 0 && r.ReferencedID != 0 )
{
if ( !AreObjectsSame( allObjects[ l.ReferencedID ], allObjects[ r.ReferencedID ], allObjects ) ) return false;
}
// Compare children objects
if ( l.Children.Count != r.Children.Count ) return false;
for ( int i = 0; i < l.Children.Count; ++i )
{
if ( !AreObjectsSame( l.Children[ i ], r.Children[ i ], allObjects ) ) return false;
}
return true;
}
Now let's examine a chain of objects. Objects having children works in a similar manner, therefore I will skip them for sake of explanation simplicity.
MyObject ("Name_A", Type_A) -> MyObject ("Name_B", Type_B) -> MyObject ("Name_C", Type_A) -> MyObject ("Name_D", Type_Base)
MyObject ("Name_E", Type_C) -> MyObject ("Name_F", Type_C) -> MyObject ("Name_C", Type_A) -> MyObject ("Name_D", Type_Base)
MyObject ("Name_G", Type_A) -> MyObject ("Name_D", Type_Base)
As you can see, we have 10 objects placed in AllObjects dictionary. In each reference chain we have objects ("Name_C", Type_A) and ("Name_D", Type_Base). Since objects ("Name_D", Type_Base) are equal (they have the same type and name), objects ("Name_C", Type_A) are also equal (they have the same type and name plus their referenced objects are same).
Now, I we can optimize above chain of objects into something like this:
MyObject ("Name_A", Type_A) -> MyObject ("Name_B", Type_B) -> MyObject ("Name_C", Type_A) -> MyObject ("Name_D", Type_Base)
MyObject ("Name_E", Type_C) -> MyObject ("Name_F", Type_C) ->-|
MyObject ("Name_G", Type_A) -------------------------------->-/
This lead us to removing 3 duplicated nodes and rearanging reference id for two of them.
Possible Solution
One of the possible solution would be to precalculate for each element its own hash code and use this hash to compare elements in a brute force manner. If comparison is true, then perform field-by-field comparison (to eliminate possible collisions in hash codes).
Is there any better way of removing duplicates? How do I do it?
Upvotes: 1
Views: 278
Reputation: 708
Some fast suggestions:
But first of all use some profiler tools like Redgate's Performance Profiler. A lot of times I was "optimizing" the algorithm using my intuition without any measurable success but making it much less understandable. After trying a profiler it turned out that a who-the-hell-would-think-of-that code, e.g. string operations, overused object instantiations take much more time than my algorithm! The profiler can point you to the code which really needs optimalization :)
Upvotes: 0