Reputation: 21
I want to find the best way to multiply two arrays element-wise. This is one part of a wider project where performance but not the only consideration.
I started writing some functions today in C# (Linqpad) and so it hasn't been optimised in any way. The output from the code below is as follows:
Environment.ProcessorCount: 4
Vector<double>.Count: 4
For sequential: 129ms, sum: 2.30619276241231E+25
Plinq: 344ms, sum: 2.30619276241231E+25
Parallel.For: 137ms, 2.30619276241231E+25
Simd sequential: 100ms, sum: 2.30619276241231E+25
Simd parallel: 761ms
This consists of the execution time for the multiplication and a sum over the results as a check. There are a few odd results here (and I'm a little rusty in C# so it could well be my code):
My code is as below - there is a reference to the Nuget System.Numerics.Vector package. I'd appreciate any comments, suggestions, corrections or alternatives...
using System.Threading.Tasks;
using System.Numerics;
using System.Collections.Concurrent;
void Main()
{
var random = new Random();
var arraySize = 20_000_001;
var x = new double[arraySize];
var y = new double[arraySize];
for (var i = 0; i < x.Length; ++i)
{
x[i] = random.Next();
y[i] = random.Next();
}
Console.WriteLine($"Environment.ProcessorCount: {Environment.ProcessorCount}");
Console.WriteLine($"Vector<double>.Count: {Vector<double>.Count}\n");
MultiplyFor(x, y);
MultiplyPlinq(x, y);
MultiplyParallelFor(x, y);
MultiplySIMD(x, y);
MultiplyParallelSIMD(x, y);
}
void MultiplyPlinq(double[] x, double[] y)
{
var result = new double[x.Length];
var sw = new Stopwatch();
sw.Start();
result = ParallelEnumerable.Range(0, x.Length).Select(i => x[i] * y[i]).ToArray();
sw.Stop();
Console.WriteLine($"Plinq: {sw.ElapsedMilliseconds}ms, sum: {SumCheck(result)}");
}
double SumCheck(double[] x)
{
return Math.Round(x.Sum() , 4);
}
void MultiplyFor(double[] x, double[] y)
{
var result = new double[x.Length];
var sw = new Stopwatch();
sw.Start();
for (var i = 0; i < x.Length; ++i)
{
result[i] = x[i] * y[i];
}
sw.Stop();
Console.WriteLine($"For sequential: {sw.ElapsedMilliseconds}ms, sum: {SumCheck(result)}");
}
void MultiplyParallelFor(double[] x, double[] y)
{
var result = new double[x.Length];
var sw = new Stopwatch();
sw.Start();
Parallel.For(0, x.Length, i =>
{
result[i] = x[i] * y[i];
});
sw.Stop();
Console.WriteLine($"Parallel.For: {sw.ElapsedMilliseconds}ms, {SumCheck(result)}");
}
void MultiplySIMD(double[] x, double[] y)
{
var sw = new Stopwatch();
sw.Start();
var result = MultiplyByVectors(x, y);
sw.Stop();
// 2 cores, 4 logical, 256b register
Console.WriteLine($"Simd sequential: {sw.ElapsedMilliseconds}ms, sum: {SumCheck(result)}");
}
double[] MultiplyByVectors(double[] x, double[] y)
{
var result = new double[x.Length];
var vectorSize = Vector<double>.Count;
int i;
for (i = 0; i < x.Length - vectorSize; i += vectorSize)
{
var vx = new Vector<double>(x, i);
var vy = new Vector<double>(y, i);
(vx * vy).CopyTo(result, i);
}
for (; i < x.Length; i++)
{
result[i] = x[i] * y[i];
}
return result;
}
void MultiplyParallelSIMD(double[] x, double[] y)
{
var sw = new Stopwatch();
sw.Start();
var chunkSize = (int)(x.Length / Environment.ProcessorCount);
Parallel.For(0, Environment.ProcessorCount, i => {
var complete = i * chunkSize;
var take = Math.Min(chunkSize, x.Length - i * chunkSize);
var xSegment = x.Skip((int)complete).Take((int)take);
var ySegment = y.Skip((int)complete).Take((int)take);
var result = MultiplyByVectors(xSegment.ToArray(), ySegment.ToArray());
});
sw.Stop();
Console.WriteLine($"Simd parallel: {sw.ElapsedMilliseconds}ms");
}
Upvotes: 2
Views: 1638
Reputation: 43495
The Parallel.For
in its simplest form is not suitable for very granular workloads, because the overhead of invoking an anonymous function on each loop negates the benefits of parallelism (anonymous functions can't be inlined). The solution is to partition the data, so that multiple partitions are processed in parallel, while each partition is processed with a fast direct loop:
Parallel.ForEach(Partitioner.Create(0, x.Length), range =>
{
for (int i = range.Item1; i < range.Item2; i++)
{
result[i] = x[i] * y[i];
}
});
The built-in Partitioner
in its current implementation creates as many partitions as the number of the CPU cores x 3.
Regarding parallelizing SIMD operations, in my own experiments I haven't observed impressive performance improvements in my PC. My theory about it is (and this is just a wild speculation, not an educated guess), that the SIMD calculations are happening so fast that the RAM can't keep up with the rate that the data are crunched by the CPU.
Upvotes: 3