Harshit
Harshit

Reputation: 159

Splitting collection into equal batches in Parallel using c#

I am trying to split collection into equal number of batches.below is the code.

   public static List<List<T>> SplitIntoBatches<T>(List<T> collection, int size)
    {
        var chunks = new List<List<T>>();
        var count = 0;
        var temp = new List<T>();

        foreach (var element in collection)
        {
            if (count++ == size)
            {
                chunks.Add(temp);
                temp = new List<T>();
                count = 1;
            }
            temp.Add(element);
        }

        chunks.Add(temp);

        return chunks;
    }

can we do it using Parallel.ForEach() for better performance as we have around 1 Million items in the list?

Thanks!

Upvotes: 2

Views: 1276

Answers (1)

Marc Gravell
Marc Gravell

Reputation: 1062520

If performance is the concern, my thoughts (in increasing order of impact):

  • right-sizing the lists when you create them would save a lot of work, i.e. figure out the output batch sizes before you start copying, i.e. temp = new List<T>(thisChunkSize)
  • working with arrays would be more effective than working with lists - new T[thisChunkSize]
  • especially if you use BlockCopy (or CopyTo, which uses that internally) rather than copying individual elements one by one
  • once you've calculated the offsets for each of the chunks, the individual block-copies could probably be executed in parallel, but I wouldn't assume it will be faster - memory bandwidth will be the limiting factor at that point
  • but the ultimate fix is: don't copy the data at all, but instead just create ranges over the existing data; for example, if using arrays, ArraySegment<T> would help; if you're open to using newer .NET features, this is a perfect fit for Memory<T>/Span<T> - creating memory/span ranges over an existing array is essentially free and instant - i.e. take a T[] and return List<Memory<T>> or similar.

Even if you can't switch to ArraySegment<T> / Memory<T> etc, returning something like that could still be used - i.e. List<ListSegment<T>> where ListSegment<T> is something like:

readonly struct ListSegment<T> { // like ArraySegment<T>, but for List<T>
    public List<T> List {get;}
    public int Offset {get;}
    public int Count {get;}
}

and have your code work with ListSegment<T> by processing the Offset and Count appropriately.

Upvotes: 4

Related Questions