Performance of Linq query on large data set

Question

I'm running a method to transactionalize data stored in a ConcurrentQueue. In CPU performance profiling, the major hit appears to be:

foreach (Item inSequence in items.Where(w => w.SequenceNumber == i.SequenceNumber && w.Device == i.Device)) {}

With 1,000 and 10,000 it is actually quite quick. At 100,000 items the performance becomes critical - that specific Linq query goes from taking about 4.5% of the total runtime CPU to over 58% of the total runtime CPU. I'm assuming the performance hit is specifically due to the size of the ConcurrentQueue, but I'm not sure what to do about it. If avoiding a Linq query resolved the issue, that would be fine. I'm just stuck with what to do. Is there some other concurrent type that would be more performant?

It's a CQ because the data is built and read asynchronously. However, during this particular method, which happens after the data is built and prior to it being read back out, it's running on a single thread.

Very loose sample is here: https://dotnetfiddle.net/hjDOva

using System;
using System.Diagnostics;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Linq;

public class Program
{
    static int count = 100000;

    public static void Main()
    {
        var items = new ConcurrentQueue();
        var r = new Random();
        for (int i = 0; i < count; i++)
        {
            items.Enqueue(new Item());
        }

        var sw = Stopwatch.StartNew();
        foreach (Item i in items.DistinctBy(d => new { d.SequenceNumber, d.Device }))
            foreach (Item inSequence in items.Where(w => w.Device == i.Device && w.SequenceNumber == i.SequenceNumber))
            {

            }

        Console.WriteLine(sw.Elapsed);
    }
}

public static class Extensions
{
    public static IEnumerable DistinctBy(this IEnumerable source, Func keySelector)
    {
        HashSet seenKeys = new HashSet();
        foreach (TSource element in source)
        {
            if (seenKeys.Add(keySelector(element)))
            {
                yield return element;
            }
        }
    }
}

public class Item
{
    #region Fields
    protected bool fixDates;
    protected string randomSerial;
    protected decimal amount;
    protected string device;
    protected DateTime depositTime;
    public int SequenceNumber = -1;
    [NonSerialized()]
    protected System.Random rnd = new Random(Int32.Parse(Guid.NewGuid().ToString().Substring(0, 8), System.Globalization.NumberStyles.HexNumber));
    #endregion

    #region Properties
    public bool FixDates
    {
        get
        {
            return this.fixDates;
        }

        set
        {
            this.fixDates = value;
        }
    }

    public string Amount
    {
        get
        {
            return this.amount.ToString();
        }

        set
        {
            this.amount = Convert.ToDecimal(value);
        }
    }

    public string RandomSerial
    {
        get { return randomSerial; }
        set { randomSerial = value; }
    }

    public string Device
    {
        get { return this.device; }
        set { this.device = value; }
    }

    public DateTime DepositTime
    {
        get { return this.depositTime; }
        set { this.depositTime = value; }
    }
    #endregion

    #region Constructors
    public Item()
    {
        fixDates = false;
        RandomSerial = Guid.NewGuid().ToString().Substring(0, 8);
        this.amount = 5.00m;
        this.device = "IC" + rnd.Next(6).ToString();
        this.depositTime = DateTime.Now;
        this.SequenceNumber = rnd.Next(10);
    }
    #endregion
}

However it doesn't offer the memory required for 100,000 items.

As to questions about using CQ, yes, I understand that queues are not ideal for this. The tool generates data to test imports for various product types. There's only a single product that requires the method where this falls, Transactionalize(). Most of the time this code is not used.

It's a CQ because the system creates the objects in parallel (this was a significant performance improvement when it happened) and in most cases they are dequeued in parallel as well.

Theodor Zoulias · Accepted Answer

Assuming that the intention of the code below is to process the items in groups, with each group having the same SequenceNumber and Device,

foreach (Item i in items.DistinctBy(d => new { d.SequenceNumber, d.Device }))
    foreach (Item inSequence in items
        .Where(w => w.Device == i.Device && w.SequenceNumber == i.SequenceNumber))
    {

    }

...you can achieve the same thing much more efficiently by using the Linq method GroupBy like this:

var groups = items.GroupBy(i => (i.SequenceNumber, i.Device));
foreach (IGrouping<(string, string), Item> group in groups)
    foreach (Item inSequence in group)
    {

    }

Notice that instead of anonymous types I used the more lightweight ValueTuples as keys, that do not require garbage collection.

If you also want to be able to search for a specific group later, again very efficiently, instead of the GroupBy use the similar ToLookup.

Performance of Linq query on large data set

Answers (1)

Related Questions