Doro
Doro

Reputation: 671

Remove duplicate comparing two large lists efficiently in C#?

I Have two big lists, list1 with 4 columns and list2 with with 3 columns. If list1 contains the same value in column 1 and column 3 as the list2, then I need to remove that item in list1. I am actually looking for some advantage and efficiently solution. Thank you for any help.

List1:
1, 5, 3, 9     // Remove this
11, 15, 18, 6  // Keep this  

List2:
1, 5, 3

List<Tuple<int, int, int, int>> list1 = new List<Tuple<int, int, int, int>>();
List<Tuple<int, int, int>> list2 = new List<Tuple<int, int, int>>();

Upvotes: 1

Views: 2159

Answers (2)

Mathew Thompson
Mathew Thompson

Reputation: 56429

Well ideally from a performance perspective you'd be able to leverage HashSet.SymmetricExceptWith, but you're using two types that are different (and Tuples at that).

Except is a possible solution:

list1 = list1.Except(list1
    .Where(l1 => list2
        .Any(l2 => l2.Item1 == l1.Item1
            && l2.Item2 == l1.Item2
            && l2.Item3 == l1.Item3)))
    .ToList();

Upvotes: 3

thewyzard
thewyzard

Reputation: 312

        var index2 = list2.ToLookup(t => Tuple.Create(t.Item1, t.Item3));
        //var index2 = list2.Select(l => Tuple.Create(l.Item1, l.Item3)).ToList();
        //index2.Sort();
        var results = from l in list1
                      where !index2.Contains(Tuple.Create(l.Item1, l.Item3))
                      select l;

This would probably be fairly efficient. The downside is the additional memory usage by index2. An alternate method of indexing that would be a bit easier on the memory is commented out. The ToList version wouldn't store a reference to your original records, so it would be more lightweight. But the ToLookup index could have more uses to you than this particular problem. ToDictionary would also be an option, instead of ToLookup, if each key is unique, but that's a step backwards into the heavyweights.

Additional gains could possibly be obtained with a couple of well placed AsParallel() calls depending on how big these lists truly are.

        var index2 = list2.AsParallel().ToLookup(t => Tuple.Create(t.Item1, t.Item3));
        var results = from l in list1.AsParallel()
                      where !index2.Contains(Tuple.Create(l.Item1, l.Item3))
                      select l;

Experiment with one or the other or both of those as only your environment could tell us whether that is best. Sometimes the expense of splitting work out over multiple threads can take longer than just serially getting the job done.

Upvotes: 1

Related Questions