Reputation: 1492
I am trying to boost performance for a .NET Core library by utilizing System.Numerics to perform SIMD operations on float[]
arrays. System.Numerics
is a bit funky right now, and I'm having a hard time seeing how it can be beneficial. I understand that in order to see a performance boost with SIMD, it must be amortized over a large quantity of computation, but given how it is currently implemented, I can't figure out how to accomplish this.
Vector<float>
requires 8 float
values - no more, no less. If I want to perform SIMD operations on a group of values smaller than 8, I am forced to copy the values to a new array and pad the remainer with zeroes. If the group of values is greater than 8, I need to copy the values, pad with zeroes to ensure its length is aligned to a multiple of 8, and then loop over them. The length requirement makes sense, but accomodating for this seems like a good way to nullify any performance gain.
I have written a test wrapper class that takes care of the padding and alignment:
public readonly struct VectorWrapper<T>
where T : unmanaged
{
#region Data Members
public readonly int Length;
private readonly T[] data_;
#endregion
#region Constructor
public VectorWrapper( T[] data )
{
Length = data.Length;
var stepSize = Vector<T>.Count;
var bufferedLength = data.Length - ( data.Length % stepSize ) + stepSize;
data_ = new T[ bufferedLength ];
data.CopyTo( data_, 0 );
}
#endregion
#region Public Methods
public T[] ToArray()
{
var returnData = new T[ Length ];
data_.AsSpan( 0, Length ).CopyTo( returnData );
return returnData;
}
#endregion
#region Operators
public static VectorWrapper<T> operator +( VectorWrapper<T> l, VectorWrapper<T> r )
{
var resultLength = l.Length;
var result = new VectorWrapper<T>( new T[ l.Length ] );
var lSpan = l.data_.AsSpan();
var rSpan = r.data_.AsSpan();
var stepSize = Vector<T>.Count;
for( var i = 0; i < resultLength; i += stepSize )
{
var lVec = new Vector<T>( lSpan.Slice( i ) );
var rVec = new Vector<T>( rSpan.Slice( i ) );
Vector.Add( lVec, rVec ).CopyTo( result.data_, i );
}
return result;
}
#endregion
}
This wrapper does the trick. The calculations appear to be correct, and Vector<T>
doesn't complain about the input count of the elements. However, it is twice as slow as a simple range-based for loop.
Here's the benchmark:
public class VectorWrapperBenchmarks
{
#region Data Members
private static float[] arrayA;
private static float[] arrayB;
private static VectorWrapper<float> vecA;
private static VectorWrapper<float> vecB;
#endregion
#region Constructor
public VectorWrapperBenchmarks()
{
arrayA = new float[ 1024 ];
arrayB = new float[ 1024 ];
for( var i = 0; i < 1024; i++ )
arrayA[ i ] = arrayB[ i ] = i;
vecA = new VectorWrapper<float>( arrayA );
vecB = new VectorWrapper<float>( arrayB );
}
#endregion
[Benchmark]
public void ForLoopSum()
{
var aA = arrayA;
var aB = arrayB;
var result = new float[ 1024 ];
for( var i = 0; i < 1024; i++ )
result[ i ] = aA[ i ] + aB[ i ];
}
[Benchmark]
public void VectorSum()
{
var vA = vecA;
var vB = vecB;
var result = vA + vB;
}
}
And the results:
| Method | Mean | Error | StdDev |
|----------- |-----------:|---------:|---------:|
| ForLoopSum | 757.6 ns | 15.67 ns | 17.41 ns |
| VectorSum | 1,335.7 ns | 17.25 ns | 16.13 ns |
My processor (i7-6700k) does support SIMD hardware acceleration, and this is running in Release mode, 64-bit with optimizations enabled on .NET Core 2.2 (Windows 10).
I realize that the Array.CopyTo()
is likely a large part of what is killing performance, but it seems there is no easy way to have both padding/alignment and data sets that don't explicitly conform to Vector<T>
's specifications.
I'm rather new to SIMD, and I understand that the C# implementation is still in its early phase. However, I don't see a clear way to actually benefit from it, especially considering it is most beneficial when scaled to larger data sets.
Is there a better way to go about this?
Upvotes: 5
Views: 1607
Reputation: 724
I´m not sure what you mean by "funky", but it´s perfectly usable right now (although it probably could be more performant). Using your case (summing floats) I get the following results over 10003 items with an elderly Haswell CPU:
BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17134.706 (1803/April2018Update/Redstone4)
Intel Core i7-4500U CPU 1.80GHz (Haswell), 1 CPU, 4 logical and 2 physical cores
Frequency=1753753 Hz, Resolution=570.2057 ns, Timer=TSC
.NET Core SDK=2.1.602
[Host] : .NET Core 2.1.9 (CoreCLR 4.6.27414.06, CoreFX 4.6.27415.01), 64bit RyuJIT
DefaultJob : .NET Core 2.1.9 (CoreCLR 4.6.27414.06, CoreFX 4.6.27415.01), 64bit RyuJIT
| Method | Mean | Error | StdDev |
|--------- |----------:|----------:|----------:|
| ScalarOp | 12.974 us | 0.2579 us | 0.2533 us |
| VectorOp | 3.956 us | 0.0570 us | 0.0505 us |
| CopyData | 1.455 us | 0.0273 us | 0.0228 us |
The copying of data from the vector back to the array is (relatively) slow, since it uses up almost half of the time. But still: the overall time of the vectorized operation is less than 1/3rd of the scalar one...
Looking at the disassembly (BenchmarkDotNet will generate it) it seems that the memory copy operation uses the (slower) unaligned op. Possibly a future version of .Net Core will look into that.
You can avoid the copy operation completely by using Span<T>
and MemoryMarshal.Cast
to put the resulting vector right into a Span. It reduces the time to sum by approx. a third compared to copying (not shown below).
For reference, the benchmark code is (floatSlots = Vector<float>.Count
; the arrays are created before the benchmark run and filled with data) and is not necessarily an optimal solution:
[Benchmark]
public void ScalarOp()
{
for (int i = 0; i < data1.Length; i++)
{
sums[i] = data1[i] + data2[i];
}
}
[Benchmark]
public void VectorOp()
{
int ceiling = data1.Length / floatSlots * floatSlots;
int leftOver = data1.Length % floatSlots;
for (int i = 0; i < ceiling; i += floatSlots)
{
Vector<float> v1 = new Vector<float>(data1, i);
Vector<float> v2 = new Vector<float>(data2, i);
(v1 + v2).CopyTo(sums, i);
}
for (int i = ceiling; i < data1.Length; i++)
{
sums[i] = data1[i] + data2[i];
}
}
[Benchmark]
public void CopyData()
{
Vector<float> v1 = new Vector<float>(8);
int ceiling = data1.Length / floatSlots * floatSlots;
int leftOver = data1.Length % floatSlots;
for (int i = 0; i < ceiling; i += floatSlots)
{
(v1).CopyTo(sums, i);
}
for(int i = ceiling; i < data1.Length; i++)
{
sums[i] = 8;
}
}
Edit: Corrected scalar benchmark since to be same as vector, added mention of Span
and MemoryMarshal.Cast
.
Upvotes: 3