Reputation: 35641

Why is Parallel.For slower for this seemingly easily parallelizable operation?

I was having some fun with http://AdventOfCode.com (day 4 challenge)

One of their problems involves taking each number 1,2,3,... , treating it as a string and computing the MD5 hash of that string. The problem requires the 1st (smallest) number whose MD5 hash (in hex) starts with 6 zeros.

I solved it with a simple for but it took around 35 secs (running in a Win10 VM on a 2012 i5 Macbook).

Seeing that the CPU utilization is fairly low, I tried the simplest optimization that came to my mind - TPL, more precisely Parallel.For.

To my surprise, the (first) result was retrieved after 42 seconds, so worse than single-threaded. The CPU utilization was way up, as expected.

Here is my C# code. Comment one or the other line for single-threaded vs TPL.

using System;
using System.Text;
using System.Threading;
using System.Security.Cryptography;
using System.Diagnostics;
using System.Threading.Tasks;

class Program
{
    static void Main(string[] args)
    {
        Day4.P2();
    }
}

class Day4
{
    //not thread-safe, make one instance per thread
    static ThreadLocal<MD5> md5Hash = new ThreadLocal<MD5>(MD5.Create);

    public static int P2()
    {
        string input = "yzbqklnj";
        var sw = Stopwatch.StartNew();

        Action<int> checkAction = i =>
        {
            var hashBytes = md5Hash.Value.ComputeHash(Encoding.UTF8.GetBytes(input + i));
            if ( hashBytes[0] == 0 && hashBytes[1] == 0 && (hashBytes[2]) == 0 )
            {
                Console.WriteLine(i + ": " + sw.Elapsed);
            }
        };

        //for (var i = 0;; i++) { checkAction(i); }
        Parallel.For(1, int.MaxValue, checkAction);

        return 0;
    }
}

My question is: why is the parallel version not decidedly superior?

How does it partition its data among the threads?

PS. When running on an actual Windows machine the results are similar, however (expectedly) the first result is not the smallest (i.e. the correct result of the problem).

Upvotes: 1

Answers (5)

Antonín Lejsek

Reputation: 6103

The behaviour is expectable as all threads divide range 1..int.MaxValue just somehow. This is a huge range, so almost all threads work on ridiculously big numbers. One thread can do useful work and start from beginning, but even this is not guaranted, so the results are unpredictable. I measured this times of Your program (time to correct result):

original serial: 00:00:28.27 
original parallel: 00:00:24.53

You can code the chunking by hand, but there is one thing to try, define that the sequence as ordered.

int result = Enumerable.Range(1, int.MaxValue)
    //.AsParallel()
    //.AsOrdered()
    .Where(i =>
    {
        var hashBytes = md5Hash.Value.ComputeHash(Encoding.UTF8.GetBytes(input + i));
        return (hashBytes[0] == 0 && hashBytes[1] == 0 && hashBytes[2] == 0);
    }).First();

Console.WriteLine(result + ": " + sw.Elapsed);

I firstly comented out two lines to make it serial.

enumerable serial: 00:00:26.68
ordered parallel: 00:01:53.41

This is a real surprise. While the first number actually is found fast (can be printed to console in Where condition in about 9.2 seconds), it turns out that the engine does not merge results until every thread returns at least one value (or runs out of sequence, presumably). So most of the time we are waiting for the slowest thread to find its value. But returning Console.WriteLine back to the Where condition would return the problem with order. While results are guaranteed, the order of processing is not.

In the end, chunking is not that hard

const int chunkSize = 100000;
int result = int.MaxValue;
object foundLock = new object();
for (int chunk = 1; chunk < int.MaxValue; chunk += chunkSize)
{
    Parallel.For(chunk, chunk + chunkSize, (i) =>
      {
          var hashBytes = md5Hash.Value.ComputeHash(Encoding.UTF8.GetBytes(input + i));
          if (hashBytes[0] == 0 && hashBytes[1] == 0 && hashBytes[2] == 0)
          {
              lock (foundLock)
              {
                  result = Math.Min(result, i);
              }
          }
      });

    if (result < int.MaxValue)
    {
        Console.WriteLine(result + ": " + sw.Elapsed);
        break;
    }
}

Result time

chunked parallel: 00:00:08.85

Upvotes: 1

Ronan Thibaudau

Reputation: 3603

Parallelization makes sense as your operations grow, if the work to be done per operation is really tiny it may get slower when parallelized as the added cost of having more threads and switching between them may be higher than the savings from using multiple CPUs.

To test this do you get similar results if you do the following : Change int.MaxValue to int.MaxValue / 1000 and wrap the INSIDE of your lambda with a for j = 1 to 1000 to make sure one unit is 1000 times more work leaving less time to scheduling less tasks and more time "in" each task.

Depending on the results you get there we'll be able to draw some conclusions.

Upvotes: 0

Andrew Palmer

Reputation: 3013

Try calling Console.WriteLine(...) on every iteration. You will probably see something at least as interesting as I see on my machine...

2: 00:00:00.0030730
1: 00:00:00.0031281
1073741824: 00:00:00.0033078
1073741825: 00:00:00.0080216
1073741826: 00:00:00.0080340
1073741827: 00:00:00.0080457

...

1073745189: 00:00:00.0663925
1073745190: 00:00:00.0664038
1073745191: 00:00:00.0664155
85: 00:00:00.0489811
86: 00:00:00.0666171
87: 00:00:00.0666364

...

40451: 00:00:01.1846214
1073753653: 00:00:01.1846293
40452: 00:00:01.1846365
1073753654: 00:00:01.1846440
40453: 00:00:01.1846527
1073753655: 00:00:01.1846633
40454: 00:00:01.1846750

... etc.

The loop counter can increment very quickly, but because the more involved computations cannot keep up, you are likely to see all sorts of values tested in a seemingly arbitrary order.

Testing in chunks, as suggested by usr, (the chunks need not be larger than the number of cores on your system) is one idea to taking advantage of multi-threaded processing, but keep in mind that you are breaking the logical correctness of your algorithm by doing so. Running your tests in a specified order is not something a multi-threaded solution can guarantee.

Upvotes: 0

displayName

Reputation: 14379

Why is parallel version not decidedly superior?

Because there is no reason for it to be. There is no guaranteed order of processing. The case might be that your threads are all busy processing the numbers that aren't having the first 6 chars 0, while the thread of sequential version turns out to be faster than all in arriving at the first correct number.

How does it [i.e. TPL] partition its data among its threads?

The exact method is not mentioned on MSDN, but the key principle is Load Balancing. Quoting MSDN page on Data Parallelism (Task Parallel Library):

Behind the scenes, the Task Scheduler partitions the task based on system resources and workload. When possible, the scheduler redistributes work among multiple threads and processors if the workload becomes unbalanced.

Finally, the answer for the parallel version is wrong as already expected, however, the numbers that I got for parallel vs sequential are vastly different from what you have stated. I got:

Sequential -
- First Number: 9962624; Elapsed Time: 20.51 seconds
Parallel -
- First Number: 1343955022; Elapsed Time: 10.06 seconds

Also, later on, the parallel version gave the next numbers at 21.7 seconds (9962624), 22.06 seconds (541160794), 23.59 seconds (541640646) respectively.

I have nothing revolutionary to conclude here but just to reiterate that clearly

it depends on the way data got partitioned "behind the scenes" by TPL.
how the partition takes place is not known.

Upvotes: 1

usr

Reputation: 171178

How does it partition its data among the threads?

That is the key question. This is not specified. It might well check bad numbers first by coincidence. Probably, you are getting a wrong answer as well because you might not necessarily get the smallest possible i, just any.

I would do this: Process chunks of numbers of, say, 10000 each roughly in order. As soon as a match is found abort processing of all chunks that are bigger than the one that has a match. You might find a few more smaller matches and have to pick the smallest one.

I'm not quite sure how to best implement this with the TPL. Parallel.ForEach loops can be aborted but I don't know if they can be reliably ordered. Probably.

Upvotes: 0

Why is Parallel.For slower for this seemingly easily parallelizable operation?

Answers (5)

Related Questions