JSK
JSK

Reputation: 583

Peculiar results while comparing Parallel.ForEach with foreach

I have been trying to optimize the performance of my WPF application where almost 40K contacts would be downloaded from the server and the complete hierarchical tree will be created in the application. The user can only search from the tree in the view. The existing code uses lots of foreach loops and that takes lots of time (alone 5-6 seconds for creating the view); so I thought of modifying the code and first thing I did was to use to Paralle.ForEach to replace foreach loop. I created a very very sample application to test the difference between the two.

Parallel.ForEach

      Parallel.ForEach(lstUsers, item =>
      {
          Console.WriteLine(item.ToString());
      });

Normal foreach

      foreach (var item in lstUsers)
      {
          Console.WriteLine(item.ToString());
      }

I ran the application against 1000 contacts and to my surprise the following were the results:

Parallel.ForEach -> 210 milliseconds

foreach - > 104 milliseconds

I thought Parallel.ForEach should take much less time than the foreach loop but I am seeing just opposite. Looks like I was totally wrong.

Upvotes: 0

Views: 603

Answers (2)

Scott Chamberlain
Scott Chamberlain

Reputation: 127543

I highly recommend you download and read the free book by Microsoft "Patterns for Parallel Programming". What you are doing is a example they cover in the "Anti-patterns" section (patterns people tend to do but are bad)

VERY SMALL LOOP BODIES

As previously mentioned, the Parallel class is implemented in a manner so as to provide for quality load balancing while incurring as little overhead as possible. There is still overhead, though. The overhead incurred by Parallel.For is largely centered around two costs:

  1. Delegate invocations. If you squint at previous examples of Parallel.For, a call to Parallel.For looks a lot like a C# for loop or a Visual Basic For loop. Don’t be fooled: it’s still a method call. One consequence of this is that the “body” of the Parallel.For “loop” is supplied to the method call as a delegate. Invoking a delegate incurs approximately the same amount of cost as a virtual method call.
  2. Synchronization between threads for load balancing. While these costs are minimized as much as possible, any amount of load balancing will incur some cost, and the more load balancing employed, the more synchronization is necessary.

For medium to large loop bodies, these costs are largely negligible. But as the size of the loop’s body decreases, the overheads become more noticeable. And for very small bodies, the loop can be completely dominated by this overhead’s cost. To support parallelization of very small loop bodies requires addressing both #1 and #2 above.

One pattern for this involves chunking the input into ranges, and then instead of replacing a sequential loop with a parallel loop, wrapping the sequential loop with a parallel loop.

The System.Concurrent.Collections.Partitioner class provides a Create method overload that accepts an integral range and returns an OrderablePartitioner<Tuple<Int32,Int32>> (a variant for Int64 instead of Int32 is also available):

public static OrderablePartitioner<Tuple<long, long>> Create(
    long fromInclusive, long toExclusive);

Overloads of Parallel.ForEach accept instances of Partitioner<T> and OrderablePartitioner<T> as sources, allowing you to pass the result of a call to Partitioner.Create into a call to Parallel.ForEach. For now, think of both Partitioner<T> and OrderablePartitioner<T> as an IEnumerable<T>.

The Tuple<Int32,Int32>represents a range from an inclusive value to an exclusive value. Consider the following sequential loop:

for (int i = from; i < to; i++)
{
    // ... Process i.
}

We could use a Parallel.For to parallelize it as follows:

Parallel.For(from, to, i =>
{
    // ... Process i.
});

Or, we could use Parallel.ForEach with a call to Partitioner.Create, wrapping a sequential loop over the range provided in the Tuple<Int32, Int32>, where the inclusiveLowerBound is represented by the tuple’s Item1 and where the exclusiveUpperBound is represented by the tuple’s Item2:

Parallel.ForEach(Partitioner.Create(from, to), range =>
{
    for (int i = range.Item1; i < range.Item2; i++)
    {
        // ... process i
    }
});

While more complex, this affords us the ability to process very small loop bodies by eschewing some of the aforementioned costs. Rather than invoking a delegate for each body invocation, we’re now amortizing the cost of the delegate invocation across all elements in the chunked range. Additionally, as far as the parallel loop is concerned, there are only a few elements to be processed: each range, rather than each index. This implicitly decreases the cost of synchronization because there are fewer elements to load-balance.

While Parallel.For should be considered the best option for parallelizing for loops, if performance measurements show that speedups are not being achieved or that they’re smaller than expected, you can try an approach like the one shown using Parallel.ForEach in conjunction with Partitioner.Create.

Also I consider writing to Console to be a form of I/O, so I also would recommend reading the section in anti-patterns on "PARALLEL LOOPS FOR I/O-BOUND WORKLOADS IN SCALABLE APPLICATIONS"

Upvotes: 1

pm100
pm100

Reputation: 50110

you are demonstrating that this algorithm (Console.Writeline) is not a good candidate for parallelization. The overhead (which is not trivial) of running it on multiple threads is not won back by time saved by running in parallel

Upvotes: 1

Related Questions