jcdmelo
jcdmelo

Reputation: 1

ARM NEON: why is vector code slower than scalar?

I am working with assembly for ARM NEON, and I came to a sequence of unrolled instructions that has actually double the execution time when compared to an equivalent scalar loop. I am actually working with intrinsics, but the code that is produced is the one shown below.

In this sequence, eight results are obtained.

    ...
    vmov      r10, r11, d18
    vld1.32   {d21}, [r10]
    vadd.i32  d21, d21, d20
    vst1.32   {d21}, [r10]
    vld1.32   {d21}, [r11]
    vadd.i32  d21, d21, d20
    vst1.32   {d21}, [r11]
    vmov      r10, r11, d19
    ...
    vmov      r10, r11, d16
    ...
    vmov      r10, r11, d17
    ...

The scalar loop is composed of six instructions with one result per iteration:

loop:
    ldr.w   r1, [r2], #4
    ldr.w   r3, [r4, r1, lsl #2]
        adds    r3, #1
    str.w   r3, [r4, r1, lsl #2]
        cmp r0, r2
        bhi.n   118 <loop>

According to my naïve assumptions, in the vector sequence I spend roughly three instructions per result, whereas in the scalar sequence there are six instructions. That would lead to a 2X decrease in the processing time, but instead I've got a 2X increase. Even if I make the vector sequence four or eight times higher, it is still twice slower than the scalar sequence.

I read through "DEN0018A_neon_programmers_guide" and a couple of TRMs for Cortex-A processors, and my guess is that three factors might be impacting the performance of the vector sequence:

  1. ARM and NEON registers are being moved frequently
  2. there might be cache problems with the pattern of memory accesses that affect more NEON than ARM
  3. there might be some problems with both ARM and NEON pipelines with the LOAD/STORE sequence

Since I have just started dealing with the NEON architecture, these might be way off the real problem. So, I would appreciate for pointers that help to producing effective intrinsic code for NEON.

Thanks, Julio de Melo

Tools: arm-linux-gnueabihf-gcc-8.2.1

Upvotes: 0

Views: 579

Answers (2)

Aki Suihkonen
Aki Suihkonen

Reputation: 20037

Speeding up histograms is very difficult (and application dependent) with SIMD.

On a very limited number of histogram bins one can speed up the naive version by creating multiple accumulators, which are conditionally increased based on input data.

This loop will increase the 8-bit histograms having just 32 elements.

    auto a = vdupq_n_u8(*input++);
    auto a0 = vceqq_u8(a, c0to15);
    auto a1 = vceqq_u8(a, c16to31);
    hist0to15 = vsubq_u8(hist0to15, a0);
    hist16to31 = vsubq_u8(hist16to31, a1);

After 255 rounds of calling this method, one can then do a widening accumulation of uint8x16_t hist0to15, hist16to31 to uint16x8x4_t hist0to31, with c0to15 having the content of {0,1,2,3,4,...15}.

One can/should interleave a few of these blocks to mitigate the ~7 cycle latency in load instructions and the 2 cycle latency between the basic arithmetic operations.

Depending on the application (like trying to find just Nth percentile of a histogram) one can use this method with multiple passes as in radix sort.

Scatter store/load instructions won't help in parallelising histogram accumulation either, unless the SIMD instruction set also has a quick method to remove/count the duplicate entries (and either compact/compress this kind of stream of tuples of offset, increment or mask off work done on duplicate lanes). I've seen some papers/proposals for histogram accelerating SIMD instructions, but I'm not aware of any implementation.

input     = [123 | 54 | 0 | 32 | 78 | 32 | 0 | 0 ]

-->         offset 123, increment = 1
-->         offset 54, increment = 1
-->         offset 0, increment = 3
-->         offset 32, increment = 2
-->         offset 78, increment = 1

Upvotes: 2

jcdmelo
jcdmelo

Reputation: 11

Thanks, both Nate and artless for the comments. Yes, Nate, you´re right, it's just a basic histogram loop. As you both mention, it won't help vectorize it, case closed. Thanks, again.

Upvotes: 0

Related Questions