Brian Antao
Brian Antao

Reputation: 131

Parallel.Foreach with localFinally gets stalled despite completing all iterations

In My Parallel.ForEach Loop the localFinally delegate does get called on all the threads. I have found this to happen as my Parallel Loop stalls. In my Parallel Loop I have about three condition check stages that return before completion of the Loop. And it seems that it is when the Threads are returned from these stages and not the execution of the entire body that it does not execute the localFinally delegate.

The Loop structure is as follows:

 var startingThread = Thread.CurrentThread;
 Parallel.ForEach(fullList, opt,
         ()=> new MultipleValues(),
         (item, loopState, index, loop) =>
         {
            if (cond 1)
                return loop;
            if (cond 2)
                {
                process(item);
                return loop;
                }
            if (cond 3)
                return loop;

            Do Work(item);
            return loop;
          },
          partial =>
           {
              Log State of startingThread and threads
            } );

I have run the loop on a small data set and logged in detail and found that while the Parallel.ForEach completes all the iterations and the Log at the last thread of localFinally is -- Calling Thread State is WaitSleepJoin for Thread 6 Loop Indx 16
the Loop still does not complete gracefully and remains stalled... any clues why the stalls ?

Cheers!

Upvotes: 3

Views: 1726

Answers (2)

svick
svick

Reputation: 244777

I think you just misunderstood what localFinally means. It's not called for each item, it's called for each thread that is used by Parallel.ForEach(). And many items can share the same thread.

The reason why it exists is that you can perform some aggregation independently on each thread, and join them together only in the end. This way, you have to deal with synchronization (and have it impact your performance) only in a very small piece of code.

For example, if you want to compute the sum of score for a collection of items, you could do it like this:

int totalSum = 0;
Parallel.ForEach(
    collection, item => Interlocked.Add(ref totalSum, ComputeScore(item)));

But here, you call Interlocked.Add() for every item, which can be slow. Using localInit and localFinally, you can rewrite the code like this:

int totalSum = 0;
Parallel.ForEach(
    collection,
    () => 0,
    (item, state, localSum) => localSum + ComputeScore(item),
    localSum => Interlocked.Add(ref totalSum, localSum));

Notice that the code uses Interlocked.Add() only in the localFinally and does access the global state in body. This way, the cost of synchronization is paid only a few times, once for each thread used.

Note: I used Interlocked in this example, because it is very simple and quite obviously correct. If the code was more complicated, I would use lock first, and try to use Interlocked only when it was necessary for good performance.

Upvotes: 1

Me.Name
Me.Name

Reputation: 12544

Just did a quick test run after seeing the definition of localFinally (executed after each thread finished), which had me suspecting that that could mean there would be far less threads created by parallelism than loops executed. e.g.

        var test = new List<List<string>> ();
        for (int i = 0; i < 1000; i++)
        {
            test.Add(null);
        }

        int finalcount = 0;
        int itemcount = 0;
        int loopcount = 0;

        Parallel.ForEach(test, () => new List<string>(),
            (item, loopState, index, loop) =>
            {
                Interlocked.Increment(ref loopcount);
                loop.Add("a");
                //Thread.Sleep(100);
                return loop;
            },
            l =>
            {
                Interlocked.Add(ref itemcount, l.Count);                    
                Interlocked.Increment(ref finalcount);                    
            });

at the end of this loop, itemcount and loopcount were 1000 as expected, and (on my machine) finalcount 1 or 2 depending on the speed of execution. In the situation with the conditions: when returned directly the execution is probably much faster and no extra threads are needed. only when the dowork is executed more threads are needed. However the parameter (l in my case) contains the combined list of all executions. Could this be the cause of the logging difference?

Upvotes: 1

Related Questions