Reputation: 25

Parallelizing pi number calculation

The program below calculates pi=3.1415 originally wrote it in C++ using OpenMP #pragma omp for and Reduction(+:sum) I am trying to do the same calculation in C# (newbie) using Parallel.For for some reason I can't make it work, how do I make sum private to each thread in C#?

using System;
using System.Diagnostics;
using System.Threading.Tasks;


namespace pi
{
    class Program
    {
        static void Main(string[] args)
        {
            long num_steps = 100000000;
            double step;
            double x, pi, sum = 0.0;
            step = 1.0 / num_steps;
            Stopwatch timer = Stopwatch.StartNew();
            Parallel.For(1, num_steps + 1, new ParallelOptions { MaxDegreeOfParallelism = 4 }, i =>
                {
                    x = (i - 0.5) * step;                    
                    sum = sum + 4.0 / (1.0 + x * x);
                });            
            pi = step * sum;
            timer.Stop();
            Console.WriteLine("\n pi with {0} steps is {1} in {2} miliseconds ", num_steps, pi, (timer.ElapsedMilliseconds));
            Console.ReadKey();
        }
    }
}

Upvotes: 1

Answers (4)

Aleksey L.

Reputation: 37996

I don't think that by parallelizing this task with Parallel.For you'll get better(faster) result. But if you still want to do this - have a look at ThreadLocal class. It provides "private" storage for each thread.

long num_steps = 100000000;
var sum = new ThreadLocal<double>(true);
var step = 1.0 / num_steps;
Stopwatch timer = Stopwatch.StartNew();
Parallel.For(1, num_steps + 1, new ParallelOptions { MaxDegreeOfParallelism = 4 }, i =>
{
    var x = (i - 0.5) * step;
    sum.Value = sum.Value + 4.0 / (1.0 + x * x);
});
var pi = step * sum.Values.Sum();
timer.Stop();
sum.Dispose();
Console.WriteLine("\n pi with {0} steps is {1} in {2} miliseconds ", num_steps, pi, (timer.ElapsedMilliseconds));
Console.ReadKey();

Upvotes: 2

Martin Liversage

Reputation: 106916

You have provided a solution to the problem as an answer to your own question. And you solution is the correct way to use Parallel.For for this task. However, you seem to get inconsistent (and slow) results but this is most likely because you compile to debug mode. Switching to release mode will give much better performance with consistent results where each new thread increases the throughput.

Parallel.For provides a general way to partition your computation into N different parts that are executed in parallel. However, when using Parallel.For the body of the loop is a delegate and that (and probably other stuff) adds some overhead. You can avoid that by doing the partitioning yourself.

Lets assume that you want to partition the problem into two parts (e.g. using two threads). You can then calculate the partial sum for index 0-499,999,999 on one thread and the partial sum for index 500,000,000-999,999,999 on another thread. The final result is computed by summing the partial sums. This idea extends to more threads.

You need a function to compute the partial sum:

Double ComputePartialSum(Int32 startIndex, Int32 count, Double step) {
  var endIndex = startIndex + count;
  var partialSum = 0D;
  for (var i = startIndex; i < endIndex; i += 1) {
    var x = (i - 0.5D)*step;
    partialSum += 4D/(1D + x*x);
  }
  return partialSum;
}

You then need to start a number of tasks (one for each degree of parallelism) to compute all the partial sums:

var degreesOfParallelism = Environment.ProcessorCount; // Or another value
var stepCount = 1000000000;
var step = 1D/stepCount;
var stopwatch = Stopwatch.StartNew();
var partitionSize = stepCount/degreesOfParallelism;
var tasks = Enumerable
  .Range(0, degreesOfParallelism)
  .Select(
    partition => {
      var count = partition < degreesOfParallelism - 1
        ? partitionSize
        : stepCount - (degreesOfParallelism - 1)*partitionSize;
      return Task.Run(() => ComputePartialSum(partition*partitionSize, count, step));
    }
  )
  .ToArray();
Task.WaitAll(tasks);
var sum = tasks.Sum(task => task.Result);
stopwatch.Stop();
var pi = step*sum;

If you want to step it up a notch you can employ SIMD instructions in the for loop of each partition using System.Numerics. You will have to do this in a 64 bit process (I believe using RyuJIT which is a relatively recent change to the .NET JIT). On my computer the Vector<Double> can contain up to 4 elements so if a partition should process say 100,000 elements the for loop can be reduced to 25,000 iterations by computing 4 doubles in parallel inside the loop.

The code is not very pretty but it gets the job done. You need to perform heap allocations to get values to and from a vector but I am very careful to not perform any allocations inside the body of the loop:

Double ComputePartialSum(Int32 startIndex, Int32 count, Double step) {
  var vectorSize = Vector<Double>.Count;
  var remainder = count%vectorSize;
  var endIndex = startIndex + count - remainder;
  var partialSumVector = Vector<Double>.Zero;
  var iVector = new Vector<Double>(Enumerable.Range(startIndex, vectorSize).Select(i => (Double) i).ToArray());
  var loopIncrementVector = new Vector<Double>(Enumerable.Repeat((Double) vectorSize, vectorSize).ToArray());
  var point5Vector = new Vector<Double>(Enumerable.Repeat(0.5D, vectorSize).ToArray());
  var stepVector = new Vector<Double>(Enumerable.Repeat(step, vectorSize).ToArray());
  var fourVector = new Vector<Double>(Enumerable.Repeat(4D, vectorSize).ToArray());
  for (var i = startIndex; i < endIndex; i += vectorSize) {
    var xVector = (iVector - point5Vector)*stepVector;
    partialSumVector += fourVector/(Vector<Double>.One + xVector*xVector);
    iVector += loopIncrementVector;
  }
  var partialSumElements = new Double[vectorSize];
  partialSumVector.CopyTo(partialSumElements);
  var partialSum = partialSumElements.Sum();
  for (var i = endIndex; i < startIndex + count; i += 1) {
    var x = (i - 0.5D)*step;
    partialSum += 4D/(1D + x*x);
  }
  return partialSum;
}

On my computer with 4 cores with hyper-threading enabled I get the following results (durations are in seconds):

Parallelism | Parallel.For | Partitioning | SIMD
------------+--------------+--------------+------
1           | 6.541        | 3.951        | 1.985
2           | 3.278        | 1.998        | 1.045
3           | 2.218        | 1.422        | 0.739
4           | 1.909        | 1.245        | 0.637
5           | 1.748        | 1.140        | 0.586
6           | 1.579        | 1.039        | 0.523
7           | 1.435        | 0.991        | 0.492
8           | 1.392        | 0.968        | 0.491

Obviously, these numbers will vary slightly on each test run.

As you can see there are diminishing returns as the number of threads increase but that is not a big surprise because my computer only has 4 physical cores.

Upvotes: 1

Farshid

Reputation: 25

Thanks guys, I have made some improvement however still the performance is not consistence When I run the original code in C++ , it consistently gives me a speedup with addition of more threads the result look like:

C++ results

#include "stdafx.h"
#include <stdio.h>
#include <iostream>
#include <omp.h>
using namespace std;

static long num_steps = 1e9;
long double step;

int main()
{
    int i;
    double x, pi, sum = 0.0;
    double start_time, run_time;

    step = 1.0 / (double)num_steps;
    for (i = 1;i <= omp_get_num_procs();i++) {

        sum = 0.0;
        omp_set_num_threads(i);
        start_time = omp_get_wtime();
#pragma omp parallel 
        {

#pragma omp for reduction(+:sum) private(x) schedule(static) 
            for (i = 1;i <= num_steps; i++) {
                x = (i - 0.5)*step;
                sum = sum + 4.0 / (1.0 + x*x);
            }
        }
        pi = step * sum;
        run_time = omp_get_wtime() - start_time;
        printf("%d thread(s) ---> pi  %f calculated in %f seconds\n", i, pi, run_time);
    }
    system("pause");
}

however when I run the C# code the results are not consistent.

using System;
using System.Diagnostics;
using System.Threading.Tasks;
using System.Threading;
namespace pi
{

    class Program
    {        
        static void Main(string[] args)
        {
            object lockObject = new object();
            long num_steps = (long)1E9;
            double step = 1.0 / num_steps;
            for (int j = 1; j <= Environment.ProcessorCount; j++)
            {
                double sum = 0.0;
                Stopwatch timer = Stopwatch.StartNew();
                Parallel.For(1, num_steps + 1, new ParallelOptions { MaxDegreeOfParallelism = j }, () => 0.0, (i, loopState, partialResult) =>
                {
                    var x = (i - 0.5) * step;
                    return partialResult + 4.0 / (1.0 + x * x);
                },
                localPartialSum =>
                {
                    lock (lockObject)
                    {
                        sum += localPartialSum;
                    }
                });
                var pi = step * sum;
                timer.Stop();
                Console.WriteLine($"{j} thread(s) ----> pi = {pi} calculated in {(timer.ElapsedMilliseconds)/1000.0} seconds");
            }
            Console.ReadKey();
        }
    }
}

results look like

C# results

apart from C# being much slower that C++ in this case, do you have any suggestions why the C# results are not consistent. Every time you run it you get a different results

Upvotes: 0

Alexander Petrov

Reputation: 14251

You can simply add lock to the existing algorithm:

// Don't do this!
lock (lockObject)
{
    x = (i - 0.5) * step;
    sum = sum + 4.0 / (1.0 + x * x);
}

But it very slow.

Try this approach:

object lockObject = new object();

long num_steps = 100000000;
Stopwatch timer = Stopwatch.StartNew();
double step = 1.0 / num_steps;
double sum = 0;

Parallel.For(1, num_steps + 1, () => 0.0, (i, loopState, partialResult) =>
{
    var x = (i - 0.5) * step;
    return partialResult + 4.0 / (1.0 + x * x);
},
localPartialSum =>
{
    lock (lockObject)
    {
        sum += localPartialSum;
    }
});

var pi = step * sum;
timer.Stop();
Console.WriteLine("\n pi with {0} steps is {1} in {2} miliseconds ", num_steps, pi, (timer.ElapsedMilliseconds));

It uses overloaded version Parallel.For method that implement the parallel aggregation pattern.

On my system this is 7 times faster than the original algorithm with the addition of locks, and 2 times faster than the version of Aleksey L.

Upvotes: 1

Parallelizing pi number calculation

Answers (4)

Related Questions