Reputation: 32192
When I ask my machine
System.Numerics.Vector<double>.Count
the answer is 4
so at least on my machine there are enough bits in the SIMD registers to hold 4 double precision numbers.
I have tried to create a Vector3 double based on System.Numerics.Vector<double>
but I don't think it's possible to create one with the same shape as System.Numerics.Vector3
that performs better than just the basic C# code without SIMD support.
For example my attempt is below. I know it is terrible code. I just wanted to explore what I could do with Vector<double>
.
There is no constructor for System.Numerics.Vector<double>
that takes N arguments. I understand why. It is because at compile time you don't know how many doubles can fit into a Vector<double>
so the library writers protect me from shooting myself in the foot.
However if I'm willing to risk do a bit of foot shooting can I improve the below code?
using System.Numerics;
public struct Vector3Double
{
public readonly double X;
public readonly double Y;
public readonly double Z;
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public Vector3Double(double x, double y, double z) : this()
{
X = x;
Y = y;
Z = z;
}
// Factory for SIMD Vector<double> but it is slow because
// I need to create an array on the heap to initialize
static Vector<double> vd(double x, double y, double z)
=> new Vector<double>(new []{x,y,z,0});
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static double Dot(Vector3Double a, Vector3Double b)
{
var s = vd( a.X, a.Y, a.Z ) * vd( b.X, b.Y, b.Z );
return s[0] + s[1] + s[2];
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector3Double Add(Vector3Double a, Vector3Double b)
{
var s = vd( a.X, a.Y, a.Z ) + vd( b.X, b.Y, b.Z );
return new Vector3Double( s[0], s[1], s[2] );
}
}
Upvotes: 1
Views: 2098
Reputation: 62
I wanted to provide a bit more information in response to this question because it will undoubtedly be found by anyone searching for more information about double-precision vectors and SIMD performance with C#'s System.Numerics namespace.
I made a working implementation of double-precision 4-dimensional vectors utilizing Vector256 as the underlying data container inside my "Vector4D" struct and then implementing the sorts of constructors, properties, operators and methods I wanted to expose from it for features suitable to 3D programming. Note that it required a bit of "unsafe" context code but it can be wrapped up and hidden from client code. I then used the Benchmark dotNet NuGet package to test it against a "standard" structure definition that defines its own four X, Y, Z, W (double) fields and does "regular" math one member at a time. The results from the benchmark show roughly a 4x improvement in speed over "SlowVector" for the SIMD-enabled "Vector4D" struct. Better results than I expected on x64 (10th gen i7 10700K, haven't tried it on my i9 12900K yet).
"Vector4D" vs "SlowVector" benchmark
(Note: The test name says "Addition" but it was actually expanded to use a mixture of all arithmetic operations in various combinations just to create a mixed-up workload)
For the sake of "fairness", both tests utilized the same data. I used an unmanaged block of memory on the heap and a Parallel.For loop to fill it with a bunch of double values. For each of the tests, the test simply interprets that memory as a Span with T as the type of structure being tested ("SlowVector" vs "Vector4D"). They each then perform an identical loop which does some mixed and matched arithmetic, including some vector by scalar computations, just to give it some "work" to run through and race to the finish line. I found that whether it was 100, 1,000, 100,000 or 1,000,000 vectors, the results come out pretty much the same on my machine with Vector4D (utilizing Vector256 internally) being the clear and obvious winner.
Note that this test could be far more rigorous and "scientific", but it was merely to prove a point and see if some significant advantage could be had by using the generic vector types provided by System.Numerics for double-precision vectors. The results spoke loud enough to me to pursue full, comprehensive implementations for different types of these (i.e., Vector2D, Vector3D and Vector4D) and then move on to double-precision quaternions, matrices and eventually some sort of "TransformD" class for game engine objects. What could be done better with the testing if you want more accurate and detailed statistics is testing different scenarios like seeing what happens you force the CPU to do just one or two SIMD vector ops at a time between other types of work so that it can't do a smooth, contiguous stream of SIMD operations. That may introduce some overhead with loading the registers and going back to scalar values when something needs to access it. Basically, rigorous testing to really collect accurate and realistic data is going to involve thinking of additional scenarios wherein the performance of SIMD ops can be compared to "regular" arithmetic ops. Some sources suggest that SIMD has a bit of overhead with loading/unloading the registers that can slow it down and make it less efficient, but if your code can be optimized to keep the registers loaded as it does things with the data it will indeed win by a wide margin (my results suggest up to 4x on my laptop's x64 CPU).
Implementing Vector2D will probably work out well utilizing Vector128. The oddball is going to be Vector3D because it has a funny alignment (3 components). I think that's the one that's gonna require me to get more serious about testing. I would prefer not to have to pad it or waste any extra space by using Vector256 under the hood, for various reasons. But I'm afraid that things may get inefficient if I have to keep using those 3x double values to load it into a Vector256 to do the math ... perhaps I can find a way to skirt around it, but I think it'll be a little tricky, even if take the address of the structure and use some unsafe code or interpret it as a Span or something. Not sure how I'll do it, but I'll probably have a look at how Microsoft did their regular Vector3 with floats and follow their example.
I'm not an expert on CPUs, just a guy who's been programming for a long, long time and working in and around the game industry and picking up tricks and knowledge from really smart people. So don't take me as some kind of authority on the matter, but I think that from the results I've discovered here that you can safely say that yes, you really could implement fast double-precision vector structures of your own that could theoretically be much faster than "regular" math/computations provided you implement and use it correctly and in such ways that the code can benefit from the use of SIMD.
Upvotes: 0
Reputation: 21
There is a way to do it if you consider your Vector3d to have size=16, that is just like if you have 4th coordinate. Then you can use Unsafe.Read<Vector<double>>(&v3d) which will do unsafe cast of Vector3d to Vector<double>. Please note that this will work only if Vector<double>.Count is 4! Once you do simd operations on Vector<double> you can cast the result back to Vector3d using again Unsafe.Read<Vector3d>(&result).
Upvotes: 2