user
user

Reputation: 101

Improve performance of getting distinct items

How can I convert below code to LINQ. Lists can sometimes contains 20k or 30k items. So I'm looking for something that improves performance and runs faster. Below is my code:

     if(list1 != null)
     {
       foreach (var item in list1)
       {
        if(!list2.Any( x => x.Name == item.Name && x.Number == item.Number))
        {
          list2.Add(item)
        }
       }
     }

I tried using Parallel.ForEach but it throws a "Collection was modified" error.

Upvotes: 0

Views: 2568

Answers (5)

SalgoMato
SalgoMato

Reputation: 319

I recently optimized Distinct call on list of immutable items which has the same types used as key - a string and an integer. I overloaded object.GetHashCode and precomputed hash code and stored into private field durcing construction. GetHashCode overload just returned value from this private member.

This modification made Distinct call maybe few hundread times faster because internally Distinct iterator uses its internal HashSet implementation that does lot of calls to GetHashCode.

Note: if you overload GetHashCode, you'll need to overload object.Equals too so that Distinct internal iterator uses it.

Remember that you cannot use this kind of hash code caching if your members can change during item object lifetime or you'll need to update cached hash code value each time any of fields used as key (i.e. used in Equals) changes.

Upvotes: 0

Andrew Morton
Andrew Morton

Reputation: 25013

You can use the LINQ Distinct method. It needs an IEqualityComparer set up, but luckily the MSDN example has just what you need already written:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Diagnostics;

namespace ConsoleApplication1
{
    class Program
    {
        static Random rand = new Random();

        // see https://learn.microsoft.com/en-us/dotnet/api/system.linq.enumerable.distinct for Distinct()
        public class Product
        {
            public string Name { get; set; }
            public int Number { get; set; }
        }

        // Custom comparer for the Product class
        class ProductComparer : IEqualityComparer<Product>
        {
            // Products are equal if their names and product numbers are equal.
            public bool Equals(Product x, Product y)
            {

                //Check whether the compared objects reference the same data.
                if (Object.ReferenceEquals(x, y)) return true;

                //Check whether any of the compared objects is null.
                if (Object.ReferenceEquals(x, null) || Object.ReferenceEquals(y, null))
                    return false;

                //Check whether the products' properties are equal.
                return x.Number == y.Number && x.Name == y.Name;
            }

            // If Equals() returns true for a pair of objects 
            // then GetHashCode() must return the same value for these objects.

            public int GetHashCode(Product product)
            {
                //Check whether the object is null
                if (Object.ReferenceEquals(product, null)) return 0;

                //Get hash code for the Name field if it is not null.
                int hashProductName = product.Name == null ? 0 : product.Name.GetHashCode();

                //Get hash code for the Code field.
                int hashProductCode = product.Number.GetHashCode();

                //Calculate the hash code for the product.
                return hashProductName ^ hashProductCode;
            }

        }

        static string RandomLetter()
        {
            return (rand.Next((int)'A', (int)'Z' + 1)).ToString();
        }

        static List<Product> CreateTestData()
        {
            int nItems = 20000;
            List<Product> data = new List<Product>(nItems);
            for (int i = 1; i <= nItems; i++)
            {
                data.Add(new Product { Name = RandomLetter() + RandomLetter(), Number = i % 10 });
            }

            return data;
        }

        static void Main(string[] args)
        {
            var list1 = CreateTestData();
            Stopwatch sw = new Stopwatch();
            sw.Start();
            List<Product> noduplicates = list1.Distinct(new ProductComparer()).ToList();
            sw.Stop();
            Console.WriteLine($"x items: {list1.Count()} no duplicates: {noduplicates.Count()} Time: {sw.ElapsedMilliseconds} ms");

            List<Product> list2 = new List<Product>();
            if (list1 != null)
            {
                sw.Restart();
                foreach (var item in list1)
                {
                    if (!list2.Any(x => x.Name == item.Name && x.Number == item.Number))
                    {
                        list2.Add(item);
                    }
                }
                sw.Stop();
                Console.WriteLine($"x items: {list1.Count()} list2: {noduplicates.Count()} Time: {sw.ElapsedMilliseconds} ms");
            }

            Console.ReadLine();

        }
    }
}

Sample output:

x items: 20000 no duplicates: 6393 Time: 12 ms
x items: 20000 list2: 6393 Time: 4225 ms

If you already had some data, you could use the Union method instead, again using the comparer.

N.B. My RandomLetter() function does not do what I intended. But it suffices.

Upvotes: 3

Matthew Whited
Matthew Whited

Reputation: 22433

Group by your distiction and select the first record from the group and create your list.

Fastest if you don't need to provide existing values

var list2 = list1.GroupBy(i => new { i.Name, i.Number })
                 .Select(g=>g.First())
                 .ToList();

Simple if you have existing values (4 times slower than the next version)

if you list2 has preexisting values you can do something like this...

var keys = list2.ToList();
var toadd = list1.GroupBy(i => new { i.Name, i.Number })
                 .Where(g => !keys.Any(i => i.Name == g.Key.Name
                                         && i.Number == g.Key.Number))
                 .Select(g=>g.First());
list2.AddRange(toadd);

Fastest if you need to update a set with existing values

public static HashSet<T> ToHashSet<T>(this IEnumerable<T> items)
{
    return new HashSet<T>(items);
}

var keys = list2.Select(i => new { i.Name, i.Number }).ToHashSet();
var toadd = list1.GroupBy(i => new { i.Name, i.Number })
                 .Where(g => !keys.Contains(g.Key))
                 .Select(g => g.First());
list2.AddRange(toadd);

Upvotes: 0

Ivan Stoev
Ivan Stoev

Reputation: 205559

20 - 30k items are not so much. All you need is to replace the potentially slow linear search

list2.Any(x => x.Name == item.Name && x.Number == item.Number)

with fast lookup data structure.

The easiest is to build a HashSet with anonymous type containing the Name and Number properties. In order to do that, you can use the following handy custom extension method:

public static class Extensions
{
    public static HashSet<T> ToHashSet<T>(this IEnumerable<T> source, IEqualityComparer<T> comparer = null)
    {
        return new HashSet<T>(source, comparer);
    }
}

and the code in question would be like this:

if (list1 != null)
{
    var keys = list2.Select(item => new { item.Name, item.Number }).ToHashSet();
    foreach (var item in list1)
    {
        var key = new { item.Name, item.Number };
        if (!keys.Contains(key))
        {
            list2.Add(item);
            keys.Add(key);
        }
    }
}

This is not LINQ, but it doesn't need to be, since LINQ is for querying, while your code is for modification.

Upvotes: 1

Paw Ormstrup Madsen
Paw Ormstrup Madsen

Reputation: 733

You can make list2 a ConcurrentBag type and do like this. im not 100% sure that it will work as intended though.

public class Item
{
    public string Name { get; set; }
    public int Number { get; set; }
}
public void test()
{
    var list1 = new List<Item>(); // add items to list1 or maybe load from a database?

    var list2 = new ConcurrentBag<Item>();

    Parallel.ForEach(list1.ToList(), (item, state, arg3) =>
    {
        if (!list2.Any(x => x.Name == item.Name && x.Number == item.Number))
        {
            list2.Add(item);
        }

    });

}

Upvotes: 0

Related Questions