Reputation: 117
The instruction vld1q_u16 seems to make my ARM NEON code about 3x slower if the address it increments is branched on. At least that is my current suspicion.
You can check out some example runnable C code here. The switch between the fast and slow code is on the define UNREASONABLY_SLOW_CODE
. I mainly just wrote this test to eliminate sources of user error. The relevant code is generated by clang
This is the slow loop notice how x17
is incremented by 16
by the vector load ldr
instruction. I've marked the control flow instructions with -->
s
.LBB0_4: // Parent Loop BB0_3 Depth=1
--> ldr q17, [x17], #16
--> cmp x17, x16
mov v18.16b, v17.16b
sli v17.8h, v17.8h, #5
sri v18.8h, v18.8h, #5
shrn v0.8b, v17.8h, #2
sri v17.8h, v17.8h, #6
shrn v2.8b, v18.8h, #8
shrn v1.8b, v17.8h, #8
st4 { v0.8b, v1.8b, v2.8b, v3.8b }, [x18], #32
--> b.lo .LBB0_4
The control flow in this loop is determined by a separate register x14
that is dedicated to solely to handling control flow. Think an i
counter from a canonical C for loop.
.LBB0_5: // Parent Loop BB0_3 Depth=1
ldr q17, [x15], #16
--> add x14, x14, #8
--> cmp x14, x11
mov v18.16b, v17.16b
sli v17.8h, v17.8h, #5
sri v18.8h, v18.8h, #5
shrn v0.8b, v17.8h, #2
sri v17.8h, v17.8h, #6
shrn v2.8b, v18.8h, #8
shrn v1.8b, v17.8h, #8
st4 { v0.8b, v1.8b, v2.8b, v3.8b }, [x16], #32
--> b.lt .LBB0_5
I'm mostly just wondering why it's slower, but if anyone could optimize the function I'm working on further I'd be curious to see what you would come up with.
Edit: I tested on an AWS Graviton2 and Apple M2 and on both the slow code seems 3x slower
Upvotes: 0
Views: 85
Reputation: 117
Just kidding it was user error. I was transforming all the data first with the SIMD loop then the scalar loop after. The loop with 1 less instruction is still slightly slower. It seems to be 1-2% slower, but that is reasonable.
Upvotes: 2