Dustin
Dustin

Reputation: 101

Fastest way for Linq to find duplicate Lists?

Given a data structure of:

class TheClass
{
    int NodeID;
    double Cost;
    List<int> NodeIDs;
}

And a List with data:

27 -- 10.0 -- 1, 5, 27
27 -- 10.0 -- 1, 5, 27
27 -- 10.0 -- 1, 5, 27
27 -- 15.5 -- 1, 4, 13, 14, 27
27 -- 10.0 -- 1, 4, 25, 26, 27
27 -- 15.5 -- 1, 4, 13, 14, 27
35 -- 10.0 -- 1, 4, 13, 14, 35

I want to reduce it to the unique NodeIDs lists

27 -- 10.0 -- 1, 5, 27
27 -- 15.5 -- 1, 4, 13, 14, 27
27 -- 10.0 -- 1, 4, 25, 26, 27
35 -- 10.0 -- 1, 4, 13, 14, 35

Then I'll be summing the Cost column (Node 27 total cost: 10.0 + 15.5 + 10.0 = 35.5) -- that part is straight forward.

What is the fastest way to remove the duplicate rows / find uniques?

Production data set will have NodeIDs lists of 100 to 200 IDs, about 1,500 in List with around 500 being unique.

I'm 100% focused on speed -- if adding some other data would help, I'm happy to (I've tried hashing the lists into a SHA value, but that turned out slower than my current grunt exhaustive search).

Upvotes: 0

Views: 97

Answers (2)

Tim Schmelter
Tim Schmelter

Reputation: 460118

If you want to remove duplicate objects according to equal lists you could create a custom IEqualityComparer<T> for lists and use that for Enumerable.GroupBy. Then you just need to create new instances of your class for each group and sum up Cost.

Here is a possible implementation (from):

public class ListEqualityComparer<T> : IEqualityComparer<List<T>>
{
    public bool Equals(List<T> lhs, List<T> rhs)
    {
        return lhs.SequenceEqual(rhs);
    }

    public int GetHashCode(List<T> list)
    {
        unchecked
        {
            int hash = 23;
            foreach (T item in list)
            {
                hash = (hash * 31) + (item == null ? 0 : item.GetHashCode());
            }
            return hash;
        }
    }
}

and here is a query that selects one (unique) instance per group:

var nodes = new List<TheClass>(); // fill ....
var uniqueAndSummedNodes = nodes
    .GroupBy(n => n.NodeIDs, new ListEqualityComparer<int>())
    .Select(grp => new TheClass
    {
        NodeID = grp.First().NodeID,  // just use the first, change accordingly
        Cost = grp.Sum(n => n.Cost),
        NodeIDs = grp.Key
    });
nodes = uniqueAndSummedNodes.ToList();

This implementation uses SequenceEqual which takes the order and the number of occurences of each number in the list into account.

Edit: I've only just seen that you don't want to sum up the group's Costs but to sum up all groups' Cost, that's simple:

double totalCost = nodes.Sum(n => n.Cost);

If you dont want to sum up the group itself replace

...
Cost = grp.Sum(n => n.Cost),

with

...
Cost = grp.First().Cost, // presumes that all are the same 

Upvotes: 2

jjaskulowski
jjaskulowski

Reputation: 2564

.GroupBy(x=> string.Join(",", x.NodeIDs)).Select(x=>x.First())

That should be faster for big data than Distinct.

Upvotes: 3

Related Questions