user1146081
user1146081

Reputation: 195

Parallel C#. Why I can't see speed up when creating numeric array

Recently I've played with parallel loops. I started with simple tasks as it is filling in a huge array.

However, the creation time was half of second when the code was NOT parallel and 6.03s (sic!) when the code WAS parallel.

How come?

I thought there is no simpler task to show benefits from parallelism as dividing huge task to smaller one, as I did.

Could some one explain?

12GB RAM, i7 extreme 980 (6 cores + 6 virtual) 3.06G

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Threading;
using System.Threading.Tasks;

namespace ParallelLoop
{
    class Program
    {

        static void Main(string[] args)
        {

            int Min = 0;
            int Max = 10;
            int ArrSize = 150000000;

            Stopwatch sw2 = new Stopwatch();
            Stopwatch sw3 = new Stopwatch();


            int[] test2 = new int[ArrSize];
            int[] test3 = new int[ArrSize];

            Random randNum = new Random();

            sw2.Start();
            for (int i = 0; i < test2.Length; i++)
            {
                test2[i] = i;
                //test2[i] = randNum.Next(Min, Max);
            }
            sw2.Stop();

            Console.ReadKey();
            Console.WriteLine("Elapsed={0}", sw2.Elapsed);

            sw3.Start();

            Parallel.For(0, test3.Length, (j) =>
                {
                    test3[j] = j;
                    //test3[j] = randNum.Next(Min, Max);
                }
                );

            sw3.Stop();

            Console.WriteLine("Elapsed={0}", sw3.Elapsed);
            Console.ReadKey();

        }
    }
}

Upvotes: 0

Views: 98

Answers (4)

Andrew Morton
Andrew Morton

Reputation: 25013

I do not get the same drastic difference as you do (on an i7 920, nominally 2.66 GHz, 6 GB RAM - hence the smaller array size in the following code) between a simple loop and using simple parallelism.

As Euphoric pointed out, you need to partition the work - there is an overload of Parallel.ForEach which takes a RangePartitioner to do that for you, and in my testing it improved the speed somewhat:

using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Diagnostics;
using System.Threading.Tasks;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            int ArrSize = 100000000;

            Stopwatch sw2 = new Stopwatch();

            int[] test2 = new int[ArrSize];
            int[] test3 = new int[ArrSize];
            int[] test4 = new int[ArrSize];

            Random randNum = new Random();

            sw2.Start();

            for (int i = 0; i < test2.Length; i++)
            {
                test2[i] = i;
            }

            sw2.Stop();

            Console.WriteLine("Linear elapsed:          {0}", sw2.Elapsed);

            sw2.Restart();

            Parallel.For(0, test3.Length, (j) =>
            {
                test3[j] = j;
            }
                );

            sw2.Stop();

            Console.WriteLine("Simple parallel elapsed: {0}", sw2.Elapsed);

            sw2.Restart();

            var rangePartitioner = Partitioner.Create(0, test4.Length);
            Parallel.ForEach(rangePartitioner, (range, loopState) =>
            {
                for (int j = range.Item1; j < range.Item2; j++)
                {
                    test4[j] = j;
                }
            });

            sw2.Stop();

            Console.WriteLine("Partitioned elapsed:     {0}", sw2.Elapsed);

            Console.ReadLine();

        }
    }
}

Sample results:

Linear elapsed: 00:00:00.2312487
Simple parallel elapsed: 00:00:00.3735587
Partitioned elapsed: 00:00:00.1239631

I compiled for x64 and ran in release mode, not debug, as that is what counts in the end.

You also need to consider the processor's cache. There is an interesting article on that at Cache-Friendly Code: Solving Manycore's Need for Faster Data Access.

Upvotes: 1

Euphoric
Euphoric

Reputation: 12849

While the other answers do have valid point, they don't provide correct solution. You can use threads to improve the performance, but you must do it in the right way. In your case, you simply divide the whole array into N blocks (where N is amount of cores you have) and have each thread work in it's own block, without touching any other's block. This way, they don't have to worry about blocking each other.

Also note of warning. Random is not thread-save, so you should make sure that each thread has it's own instance. This will reduce randomness, but it is only way to use it with parallelism.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Threading;
using System.Threading.Tasks;

namespace ParallelLoop
{
    class Program
    {

        static void Main(string[] args)
        {

            int Min = 0;
            int Max = 10;
            int ArrSize = 150000000;

            Stopwatch sw2 = new Stopwatch();
            Stopwatch sw3 = new Stopwatch();
            Stopwatch sw4 = new Stopwatch();

            int[] test2 = new int[ArrSize];
            int[] test3 = new int[ArrSize];
            int[] test4 = new int[ArrSize];

            Random randNum = new Random();

            sw2.Start();
            for (int i = 0; i < test2.Length; i++)
            {
                test2[i] = i;
                //test2[i] = randNum.Next(Min, Max);
            }
            sw2.Stop();

            //Console.ReadKey();
            Console.WriteLine("Elapsed={0}", sw2.Elapsed);

            sw3.Start();

            Parallel.For(0, test3.Length, (j) =>
            {
                test3[j] = j;
                //test3[j] = randNum.Next(Min, Max);
            }
                );

            sw3.Stop();

            Console.WriteLine("Elapsed={0}", sw3.Elapsed);

            sw4.Start();

            int numberOfCores = 4;

            int itemsPerCore = ArrSize / numberOfCores;

            for (int i = 0; i < numberOfCores; i++)
            {
                int x = i; // for lambda closure
                var thread = new Thread(new ThreadStart(() =>
                {
                    int from = itemsPerCore * x;
                    int to = itemsPerCore * (x + 1);
                    for (int j = from; j < to; j++)
                    {
                        test4[j] = j;
                        //test4[j] = randNum.Next(Min, Max);                        
                    }
                }));

                thread.Start();
            }

            sw4.Stop();

            Console.WriteLine("Elapsed={0}", sw4.Elapsed);

            Console.ReadKey();
        }
    }
}

Upvotes: 2

John Koerner
John Koerner

Reputation: 38077

As xxbbcc has pointed out, it is probably the context switching that is taking longer than the setting of a simple array value. You could simulate a long running operation by sleeping the thread to get a better idea of the performance gain:

[TestMethod]
public void One()
{
    int Min = 0;
    int Max = 10;
    int ArrSize = 1500;

    Stopwatch sw2 = new Stopwatch();
    Stopwatch sw3 = new Stopwatch();


    int[] test2 = new int[ArrSize];
    int[] test3 = new int[ArrSize];

    Random randNum = new Random();

    sw2.Start();
    for (int i = 0; i < test2.Length; i++)
    {
        test2[i] = i;
        Thread.Sleep(10);
        //test2[i] = randNum.Next(Min, Max);
    }
    sw2.Stop();

    Console.WriteLine("Elapsed={0}", sw2.Elapsed);

    sw3.Start();

    Parallel.For(0, test3.Length, (j) =>
    {
        test3[j] = j;
        Thread.Sleep(10);
        //test3[j] = randNum.Next(Min, Max);
    }
        );

    sw3.Stop();

    Console.WriteLine("Elapsed={0}", sw3.Elapsed);
}

Which yields the output:

Elapsed=00:00:16.4813668
Elapsed=00:00:00.7327932

Upvotes: 1

xxbbcc
xxbbcc

Reputation: 17327

I have decided to make my comment an answer. I haven't actually run the sample code included in the question but your task inside the Parallel loop is very simple: setting an array slot to an integer value is about the simplest thing the CPU can do and it does it very, very fast.

Compared to this, the cost of creating and switching threads to split up your initialization loop is huge: a thread switch can be tens of thousands of CPU cycles and the more threads you have, the more switching must happen to keep them running.

So in the case of your example, the thread switching code likely eats up any possible gains you get from splitting up your otherwise very long loop. If you tried doing something more complex inside your loop, using a Parallel loop would gain you more because the cost of the thread switch (which is still huge) would be dwarfed by the cose of a single loop iteration.

Joe Duffy has several articles that mention the cost of a context switch - here's one worth reading - he mentions that the cost is somewhere between 4,000+ to 10,000+ CPU cycles to perform a context switch.

Upvotes: 1

Related Questions