J. Rehbein
J. Rehbein

Reputation: 117

Is it slow to branch on a memory offset incremented by vld1q in ARM NEON?

The instruction vld1q_u16 seems to make my ARM NEON code about 3x slower if the address it increments is branched on. At least that is my current suspicion.

You can check out some example runnable C code here. The switch between the fast and slow code is on the define UNREASONABLY_SLOW_CODE. I mainly just wrote this test to eliminate sources of user error. The relevant code is generated by clang

The NEON Kernel

The Slow Loop

This is the slow loop notice how x17 is incremented by 16 by the vector load ldr instruction. I've marked the control flow instructions with -->s

.LBB0_4:                                //   Parent Loop BB0_3 Depth=1
-->     ldr     q17, [x17], #16
-->     cmp     x17, x16
        mov     v18.16b, v17.16b
        sli     v17.8h, v17.8h, #5
        sri     v18.8h, v18.8h, #5
        shrn    v0.8b, v17.8h, #2
        sri     v17.8h, v17.8h, #6
        shrn    v2.8b, v18.8h, #8
        shrn    v1.8b, v17.8h, #8
        st4     { v0.8b, v1.8b, v2.8b, v3.8b }, [x18], #32
-->     b.lo    .LBB0_4

The Fast Loop

The control flow in this loop is determined by a separate register x14 that is dedicated to solely to handling control flow. Think an i counter from a canonical C for loop.

.LBB0_5:                                //   Parent Loop BB0_3 Depth=1
        ldr     q17, [x15], #16
-->     add     x14, x14, #8
-->     cmp     x14, x11
        mov     v18.16b, v17.16b
        sli     v17.8h, v17.8h, #5
        sri     v18.8h, v18.8h, #5
        shrn    v0.8b, v17.8h, #2
        sri     v17.8h, v17.8h, #6
        shrn    v2.8b, v18.8h, #8
        shrn    v1.8b, v17.8h, #8
        st4     { v0.8b, v1.8b, v2.8b, v3.8b }, [x16], #32
-->     b.lt    .LBB0_5

I'm mostly just wondering why it's slower, but if anyone could optimize the function I'm working on further I'd be curious to see what you would come up with.

Edit: I tested on an AWS Graviton2 and Apple M2 and on both the slow code seems 3x slower

Upvotes: 0

Views: 85

Answers (1)

J. Rehbein
J. Rehbein

Reputation: 117

Just kidding it was user error. I was transforming all the data first with the SIMD loop then the scalar loop after. The loop with 1 less instruction is still slightly slower. It seems to be 1-2% slower, but that is reasonable.

Upvotes: 2

Related Questions