Reputation: 233
I'm working on optimizing a 4D (128 Bit) matrix-vector multiplication using ARM NEON Assembler.
If I load the matrix, and the vector into the NEON Registers and transform it, I won't get a great performance boost, because the switch to the NEON Registers cost 20 cycles. Furthermore I reload the matrix for each multiplication, despite it has not changed.
There is enough register-space to perform the transformation on more vectors a time. This IS increasing performance.
But..
I'm wondering how fast this operation would be if I do the loop over all vertices (increasing pointers) within the assembler. But I am at the very beginning of Neon assembler and though don't know how to do this. Can someone give me an hand on that?
What I want to achieve:
existing C-Version of loop:
void TransformVertices(ESMatrix* m, GLfloat* vertices, GLfloat* normals, int count)
{
GLfloat* pVertex = vertices;
int i;
// iterate trough vertices only one at a time
for (i = 0; i < count ; i ++)
{
Matrix4Vector4Mul( (float *)m, (float *)pVertex, (float *)pVertex);
pVertex += 4;
}
//LoadMatrix( (const float*) m);
//// two at a time
//for (i = 0; i < count ; i += 2)
//{
// Matrix4Vector4Mul2( (float *)m, (float *)pVertex, (float *)(pVertex + 4));
// pVertex += 8;
//}
}
Following code for NEON-Version on doing only one transformation:
void Matrix4Vector4Mul (const float* m, const float* vIn, float* vOut)
{
asm volatile
(
"vldmia %1, {q1-q4 } \n\t"
"vldmia %2, {q5} \n\t"
"vmul.f32 q0, q1, d10[0] \n\t"
"vmla.f32 q0, q2, d10[1] \n\t"
"vmla.f32 q0, q3, d11[0] \n\t"
"vmla.f32 q0, q4, d11[1] \n\t"
"vstmia %0, {q0}"
: // no output
: "r" (vOut), "r" (m), "r" (vIn)
: "memory", "q0", "q1", "q2", "q3", "q4", "q5"
);
}
C-Version of transformation:
void Matrix4Vector4Mul (const float* m, const float* vIn, float* vOut)
{
Vertex4D* v1 = (Vertex4D*)vIn;
Vertex4D vOut1;
Vertex4D* l0;
Vertex4D* l1;
Vertex4D* l2;
Vertex4D* l3;
// 4x4 Matrix with members m00 - m33
ESMatrix* m1 = (ESMatrix*)m;
l0 = (Vertex4D*)&m1->m00;
vOut1.x = l0->x * v1->x;
vOut1.y = l0->y * v1->x;
vOut1.z = l0->z * v1->x;
vOut1.w = l0->w * v1->x;
l1 = (Vertex4D*)&m1->m10;
vOut1.x += l1->x * v1->y;
vOut1.y += l1->y * v1->y;
vOut1.z += l1->z * v1->y;
vOut1.w += l1->w * v1->y;
l2 = (Vertex4D*)&m1->m20;
vOut1.x += l2->x * v1->z;
vOut1.y += l2->y * v1->z;
vOut1.z += l2->z * v1->z;
vOut1.w += l2->w * v1->z;
l3 = (Vertex4D*)&m1->m30;
vOut1.x += l3->x * v1->w;
vOut1.y += l3->y * v1->w;
vOut1.z += l3->z * v1->w;
vOut1.w += l3->w * v1->w;
*(vOut) = vOut1.x;
*(vOut + 1) = vOut1.y;
*(vOut + 2) = vOut1.z;
*(vOut + 3) = vOut1.w;
}
Performance: (Transform > 90 000 Vertices | Android 4.0.4 SGS II)
C-Version: 190 FPS
NEON-Version: 162 FPS ( .. slower -.- )
--- LOAD Matrix only ONCE (seperate ASM) and then perform two V's at a time ---
NEON-Version: 217 FPS ( + 33 % NEON | + 14 % C-Code )
Upvotes: 7
Views: 1932
Reputation: 6354
It's an almost one full year old topic by now, but I think it's important to give you the "correct" answer since something is very fishy here, and noone has pointed this out so far :
You should avoid using q4-q7 if possible since they have to be preserved prior to use
Correct me if I'm wrong on this, but if my memory isn't failing me, only d0~d3 (or d0~d7) can hold scalars. I'm really wondering why gcc is tolerating d10 and d11 as scalar operands. Since it's physically impossible that way, I guess gcc is again doing something crazy with your inline assembly. Check out the disassembly of your inline assembly code.
True, your inline assembly code suffers from two interlocks (2cycles after load and 9 cycles before store), but it's unimaginable for me that the NEON code runs slower than the C code.
It's a really strong guess from my side that gcc does some heavy register transferring back and forth instead spitting out an error message. And it isn't exactly doing a favor in this case.
Upvotes: 0
Reputation: 20027
The hand tuned neon version suffers from dependency between all of the operations, while gcc is able to do out-of-order scheduling for the c-version. You should be able to improve the NEON version by calculating in parallel two or more independent threads:
Pointer increment (post increment) in NEON is done with exclamation mark. Those registers then should be included in the output register list "=r" (vOut)
vld1.32 {d0,d1}, [%2]! ; // next round %2=%2 + 16
vst1.32 {d0}, [%3]! ; // next round %3=%3 + 8
Another addressing mode allows post increment by a "stride" defined in another arm register. The option is available only on some load commands (as there are a variety of interleaving options as well as loading to chosen elements of say d1[1] (upper part)).
vld1.16 d0, [%2], %3 ; // increment by register %3
The counter increment happens with sequence
1: subs %3, %3, #1 ; // with "=r" (count) as fourth argument
bne 1b ; // create a local label
Local label is used, as two "bne loop" statements in the same file causes an error
One should be able to increase parallelism by a factor of four by calculating fused multiply adds for vectors instead of single elements.
In this case it's worthwhile to perform a matrix transpose in advance (either before calling the routine or with special addressing mode).
asm(
"vld1.32 {d0[0],d2[0],d4[0],d6[0]}, [%0]! \n\t"
"vld1.32 {d0[1],d2[1],d4[1],d6[1]}, [%0]! \n\t"
"vld1.32 {d1[0],d3[0],d5[0],d7[0]}, [%0]! \n\t"
"vld1.32 {d1[1],d3[1],d5[1],d7[1]}, [%0]! \n\t"
"vld1.32 {q8}, [%2:128]! \n\t"
"vld1.32 {q9}, [%2:128]! \n\t"
"vld1.32 {q10}, [%2:128]! \n\t"
"vld1.32 {q11}, [%2:128]! \n\t"
"subs %0, %0, %0 \n\t" // set zero flag
"1: \n\t"
"vst1.32 {q4}, [%1:128]! \n\t"
"vmul.f32 q4, q8, q0 \n\t"
"vst1.32 {q5}, [%1:128]! \n\t"
"vmul.f32 q5, q9, q0 \n\t"
"vst1.32 {q6}, [%1:128]! \n\t"
"vmul.f32 q6, q10, q0 \n\t"
"vst1.32 {q7}, [%1:128]! \n\t"
"vmul.f32 q7, q11, q0 \n\t"
"subne %1,%1, #64 \n\t" // revert writing pointer in 1st iteration
"vmla.f32 q4, q8, q1 \n\t"
"vmla.f32 q5, q9, q1 \n\t"
"vmla.f32 q6, q10, q1 \n\t"
"vmla.f32 q7, q11, q1 \n\t"
"subs %2, %2, #1 \n\t"
"vmla.f32 q4, q8, q2 \n\t"
"vmla.f32 q5, q9, q2 \n\t"
"vmla.f32 q6, q10, q2 \n\t"
"vmla.f32 q7, q11, q2 \n\t"
"vmla.f32 q4, q8, q3 \n\t"
"vld1.32 {q8}, [%2:128]! \n\t" // start loading vectors immediately
"vmla.f32 q5, q9, q3 \n\t"
"vld1.32 {q9}, [%2:128]! \n\t" // when all arithmetic is done
"vmla.f32 q6, q10, q3 \n\t"
"vld1.32 {q10}, [%2:128]! \n\t"
"vmla.f32 q7, q11, q3 \n\t"
"vld1.32 {q11}, [%2:128]! \n\t"
"jnz b1 \n\t"
"vst1.32 {q4,q5}, [%1:128]! \n\t" // write after first loop
"vst1.32 {q6,q7}, [%1:128]! \n\t"
: "=r" (m), "=r" (vOut), "=r" (vIn), "=r" ( N ),
:
: "d0","d1","q0", ... ); // marking q0 isn't enough for some gcc version
Read and write to 128 bit aligned blocks (make sure the data ptr is aligned too)
there's a malloc with align, or just adjust manually ptr=((int)ptr + 15) & ~15
.
Just as there is a post loop block writing the results, one can write a similar pre loop block that skips the first write of nonsense to vOut (that could also be overcome by conditional write). One can unfortunately only write 64 bit registers conditionally.
Upvotes: 0
Reputation: 28087
Did you try playing with compiler flags?
-mcpu=cortex-a9 -mtune=cortex-a9 -mfloat-abi=softfp -mfpu=neon -O3
does pretty job for me in this case (gcc 4.4.3, distributed with Android NDK 8b). Try to have tight source code by defining internal functions static and inline as well as moving matrix (m[X][0] stuff) to static global variables or just merge Matrix4Vector4Mul into loop and make matrix local variables instead of keep passing it in function - gcc doesn't get smart there.
When I do this, I get below for the main loop.
a4: ed567a03 vldr s15, [r6, #-12]
a8: ee276aa0 vmul.f32 s12, s15, s1
ac: ee676aa8 vmul.f32 s13, s15, s17
b0: ed564a04 vldr s9, [r6, #-16]
b4: ee277a88 vmul.f32 s14, s15, s16
b8: ed165a02 vldr s10, [r6, #-8]
bc: ee677a80 vmul.f32 s15, s15, s0
c0: ed565a01 vldr s11, [r6, #-4]
c4: e2833001 add r3, r3, #1
c8: ee046a89 vmla.f32 s12, s9, s18
cc: e1530004 cmp r3, r4
d0: ee446aaa vmla.f32 s13, s9, s21
d4: ee047a8a vmla.f32 s14, s9, s20
d8: ee447aa9 vmla.f32 s15, s9, s19
dc: ee056a22 vmla.f32 s12, s10, s5
e0: ee456a01 vmla.f32 s13, s10, s2
e4: ee057a21 vmla.f32 s14, s10, s3
e8: ee457a02 vmla.f32 s15, s10, s4
ec: ee056a8b vmla.f32 s12, s11, s22
f0: ee456a83 vmla.f32 s13, s11, s6
f4: ee057aa3 vmla.f32 s14, s11, s7
f8: ee457a84 vmla.f32 s15, s11, s8
fc: ed066a01 vstr s12, [r6, #-4]
100: ed466a04 vstr s13, [r6, #-16]
104: ed067a03 vstr s14, [r6, #-12]
108: ed467a02 vstr s15, [r6, #-8]
10c: e2866010 add r6, r6, #16
110: 1affffe3 bne a4 <TransformVertices+0xa4>
Having 4 loads, 4 multiplies, 12 multiply and accumulates and 4 stores which matches with what you are doing in Matrix4Vector4Mul.
If you are still not satisfied with compiler generated code, pass compiler '-S' to get assembly output and use that as a starting point to improve further instead of starting from scratch.
You should also check that vertices
is cache line size aligned (32 bytes for Cortex-A9) to get a nice data flow.
For vectorization there are gcc options like -ftree-vectorizer-verbose=9
to print information what was vectorized. Also search in gcc documentation this one to see how you can direct gcc or what you need to modify to get your multiplications vectorized. This might sound a lot to dig in but it would be more fruitful for you in long run than 'hand vectorizing'.
Upvotes: 1