Reputation: 583
I have been trying to optimize the performance of my WPF application where almost 40K contacts would be downloaded from the server and the complete hierarchical tree will be created in the application. The user can only search from the tree in the view. The existing code uses lots of foreach loops and that takes lots of time (alone 5-6 seconds for creating the view); so I thought of modifying the code and first thing I did was to use to Paralle.ForEach to replace foreach loop. I created a very very sample application to test the difference between the two.
Parallel.ForEach
Parallel.ForEach(lstUsers, item =>
{
Console.WriteLine(item.ToString());
});
Normal foreach
foreach (var item in lstUsers)
{
Console.WriteLine(item.ToString());
}
I ran the application against 1000 contacts and to my surprise the following were the results:
Parallel.ForEach -> 210 milliseconds
foreach - > 104 milliseconds
I thought Parallel.ForEach should take much less time than the foreach loop but I am seeing just opposite. Looks like I was totally wrong.
Upvotes: 0
Views: 603
Reputation: 127543
I highly recommend you download and read the free book by Microsoft "Patterns for Parallel Programming". What you are doing is a example they cover in the "Anti-patterns" section (patterns people tend to do but are bad)
VERY SMALL LOOP BODIES
As previously mentioned, the
Parallel
class is implemented in a manner so as to provide for quality load balancing while incurring as little overhead as possible. There is still overhead, though. The overhead incurred byParallel.For
is largely centered around two costs:
- Delegate invocations. If you squint at previous examples of
Parallel.For
, a call toParallel.For
looks a lot like a C# for loop or a Visual Basic For loop. Don’t be fooled: it’s still a method call. One consequence of this is that the “body” of theParallel.For
“loop” is supplied to the method call as a delegate. Invoking a delegate incurs approximately the same amount of cost as a virtual method call.- Synchronization between threads for load balancing. While these costs are minimized as much as possible, any amount of load balancing will incur some cost, and the more load balancing employed, the more synchronization is necessary.
For medium to large loop bodies, these costs are largely negligible. But as the size of the loop’s body decreases, the overheads become more noticeable. And for very small bodies, the loop can be completely dominated by this overhead’s cost. To support parallelization of very small loop bodies requires addressing both #1 and #2 above.
One pattern for this involves chunking the input into ranges, and then instead of replacing a sequential loop with a parallel loop, wrapping the sequential loop with a parallel loop.
The
System.Concurrent.Collections.Partitioner
class provides aCreate
method overload that accepts an integral range and returns anOrderablePartitioner<Tuple<Int32,Int32>>
(a variant forInt64
instead ofInt32
is also available):public static OrderablePartitioner<Tuple<long, long>> Create( long fromInclusive, long toExclusive);
Overloads of
Parallel.ForEach
accept instances ofPartitioner<T>
andOrderablePartitioner<T>
as sources, allowing you to pass the result of a call toPartitioner.Create
into a call toParallel.ForEach
. For now, think of bothPartitioner<T>
andOrderablePartitioner<T>
as anIEnumerable<T>
.The
Tuple<Int32,Int32>
represents a range from an inclusive value to an exclusive value. Consider the following sequential loop:for (int i = from; i < to; i++) { // ... Process i. }
We could use a
Parallel.For
to parallelize it as follows:Parallel.For(from, to, i => { // ... Process i. });
Or, we could use
Parallel.ForEach
with a call toPartitioner.Create
, wrapping a sequential loop over the range provided in theTuple<Int32, Int32>
, where theinclusiveLowerBound
is represented by the tuple’sItem1
and where theexclusiveUpperBound
is represented by the tuple’sItem2
:Parallel.ForEach(Partitioner.Create(from, to), range => { for (int i = range.Item1; i < range.Item2; i++) { // ... process i } });
While more complex, this affords us the ability to process very small loop bodies by eschewing some of the aforementioned costs. Rather than invoking a delegate for each body invocation, we’re now amortizing the cost of the delegate invocation across all elements in the chunked range. Additionally, as far as the parallel loop is concerned, there are only a few elements to be processed: each range, rather than each index. This implicitly decreases the cost of synchronization because there are fewer elements to load-balance.
While
Parallel.For
should be considered the best option for parallelizing for loops, if performance measurements show that speedups are not being achieved or that they’re smaller than expected, you can try an approach like the one shown usingParallel.ForEach
in conjunction withPartitioner.Create
.
Also I consider writing to Console
to be a form of I/O, so I also would recommend reading the section in anti-patterns on "PARALLEL LOOPS FOR I/O-BOUND WORKLOADS IN SCALABLE APPLICATIONS"
Upvotes: 1
Reputation: 50110
you are demonstrating that this algorithm (Console.Writeline) is not a good candidate for parallelization. The overhead (which is not trivial) of running it on multiple threads is not won back by time saved by running in parallel
Upvotes: 1