Kelly
Kelly

Reputation: 295

Why might a 2D vector struct be slower to add than a 3D vector struct in C#?s

I recently made a benchmarking app to exploring a few approaches to writing addition operators for math structs in C#: https://github.com/nickgravelyn/math-struct-benchmark. Among the results I found that Vector2 was consistently slower than Vector3, despite it being less data and having fewer instructions. More intriguing is that this appears to be the case on every runtime/JIT I tested.

For example when running on .NET Core 2.2 the benchmark for the + operator for one of the tested Vector2 implementations took 921.82 ms whereas the comparable Vector3 implementation took 422.76 ms.

Is there some reason either from C#, IL, or native assembly that would explain why I might see these results? Or did I mess something up in my benchmark somewhere that I can't seem to spot?

Upvotes: 3

Views: 136

Answers (2)

Falco Alexander
Falco Alexander

Reputation: 3332

I try to add some findings, although not an answer so far. BenchmarkDotNet was showing me the same results like you.

First I did some profiling in VS with instrumentation so it shows there's no doubt that the addition itself is the consuming the time and bringing the big difference.

results of 64 bit code executed: screenshot from VS profiler report

vs. 32 bit:

enter image description here

the IL code of these 2 lines is this:

        // value += value2;
    IL_0059: ldloc.0
    IL_005a: ldloc.1
    IL_005b: call valuetype UserQuery/Vector2_A UserQuery/Vector2_A::op_Addition(valuetype UserQuery/Vector2_A, valuetype UserQuery/Vector2_A)
    IL_0060: stloc.0
    // value3 += value4;
    IL_0061: ldloc.2
    IL_0062: ldloc.3
    IL_0063: call valuetype UserQuery/Vector3_A UserQuery/Vector3_A::op_Addition(valuetype UserQuery/Vector3_A, valuetype UserQuery/Vector3_A)
    IL_0068: stloc.2

followed by the 2 add operation methods, 2 dim:

.method public hidebysig specialname static 
valuetype UserQuery/Vector2_A op_Addition (
    valuetype UserQuery/Vector2_A value1,
    valuetype UserQuery/Vector2_A value2
) cil managed 
{
// Method begins at RVA 0x2100
// Code size 37 (0x25)
.maxstack 3
.locals init (
    [0] valuetype UserQuery/Vector2_A
)

// (no C# code)
IL_0000: nop
// return new Vector2_A(value1.X + value2.X, value1.Y + value2.Y);
IL_0001: ldarg.0
IL_0002: ldfld float32 UserQuery/Vector2_A::X
IL_0007: ldarg.1
IL_0008: ldfld float32 UserQuery/Vector2_A::X
IL_000d: add
IL_000e: ldarg.0
IL_000f: ldfld float32 UserQuery/Vector2_A::Y
IL_0014: ldarg.1
IL_0015: ldfld float32 UserQuery/Vector2_A::Y
IL_001a: add
IL_001b: newobj instance void UserQuery/Vector2_A::.ctor(float32, float32)
IL_0020: stloc.0
// (no C# code)
IL_0021: br.s IL_0023

IL_0023: ldloc.0
IL_0024: ret
} // end of method Vector2_A::op_Addition

and the 3 dimensional one:

.method public hidebysig specialname static 
valuetype UserQuery/Vector3_A op_Addition (
    valuetype UserQuery/Vector3_A value1,
    valuetype UserQuery/Vector3_A value2
) cil managed 
{
// Method begins at RVA 0x214c
// Code size 50 (0x32)
.maxstack 4
.locals init (
    [0] valuetype UserQuery/Vector3_A
)

// (no C# code)
IL_0000: nop
// return new Vector3_A(value1.X + value2.X, value1.Y + value2.Y, value1.Z + value2.Z);
IL_0001: ldarg.0
IL_0002: ldfld float32 UserQuery/Vector3_A::X
IL_0007: ldarg.1
IL_0008: ldfld float32 UserQuery/Vector3_A::X
IL_000d: add
IL_000e: ldarg.0
IL_000f: ldfld float32 UserQuery/Vector3_A::Y
IL_0014: ldarg.1
IL_0015: ldfld float32 UserQuery/Vector3_A::Y
IL_001a: add
IL_001b: ldarg.0
IL_001c: ldfld float32 UserQuery/Vector3_A::Z
IL_0021: ldarg.1
IL_0022: ldfld float32 UserQuery/Vector3_A::Z
IL_0027: add
IL_0028: newobj instance void UserQuery/Vector3_A::.ctor(float32, float32, float32)
IL_002d: stloc.0
// (no C# code)
IL_002e: br.s IL_0030

IL_0030: ldloc.0
IL_0031: ret
} // end of method Vector3_A::op_Addition

to be honest, the rest is pure guessing that the 3 dim add method has some advantages with mem / stack alignment, as it states Code size 0x32 vs. 0x25 and maxstack 4 vs. 3.

checking the x64 assembler results of RjuJIT would let me run out of talent so far. May be worth to ping one the JIT experts of MS for this?

Upvotes: 0

Kelly
Kelly

Reputation: 295

After more digging it is an issue with 64-bit RyuJIT code gen. I’ve got an issue filed with the CoreCLR and it seems like this is related or identical to some other performance issues.

Upvotes: 1

Related Questions