Reputation: 1
I am working with assembly for ARM NEON, and I came to a sequence of unrolled instructions that has actually double the execution time when compared to an equivalent scalar loop. I am actually working with intrinsics, but the code that is produced is the one shown below.
In this sequence, eight results are obtained.
...
vmov r10, r11, d18
vld1.32 {d21}, [r10]
vadd.i32 d21, d21, d20
vst1.32 {d21}, [r10]
vld1.32 {d21}, [r11]
vadd.i32 d21, d21, d20
vst1.32 {d21}, [r11]
vmov r10, r11, d19
...
vmov r10, r11, d16
...
vmov r10, r11, d17
...
The scalar loop is composed of six instructions with one result per iteration:
loop:
ldr.w r1, [r2], #4
ldr.w r3, [r4, r1, lsl #2]
adds r3, #1
str.w r3, [r4, r1, lsl #2]
cmp r0, r2
bhi.n 118 <loop>
According to my naïve assumptions, in the vector sequence I spend roughly three instructions per result, whereas in the scalar sequence there are six instructions. That would lead to a 2X decrease in the processing time, but instead I've got a 2X increase. Even if I make the vector sequence four or eight times higher, it is still twice slower than the scalar sequence.
I read through "DEN0018A_neon_programmers_guide" and a couple of TRMs for Cortex-A processors, and my guess is that three factors might be impacting the performance of the vector sequence:
Since I have just started dealing with the NEON architecture, these might be way off the real problem. So, I would appreciate for pointers that help to producing effective intrinsic code for NEON.
Thanks, Julio de Melo
Tools: arm-linux-gnueabihf-gcc-8.2.1
Upvotes: 0
Views: 579
Reputation: 20037
Speeding up histograms is very difficult (and application dependent) with SIMD.
On a very limited number of histogram bins one can speed up the naive version by creating multiple accumulators, which are conditionally increased based on input data.
This loop will increase the 8-bit histograms having just 32 elements.
auto a = vdupq_n_u8(*input++);
auto a0 = vceqq_u8(a, c0to15);
auto a1 = vceqq_u8(a, c16to31);
hist0to15 = vsubq_u8(hist0to15, a0);
hist16to31 = vsubq_u8(hist16to31, a1);
After 255 rounds of calling this method, one can then do a widening accumulation of uint8x16_t hist0to15, hist16to31
to uint16x8x4_t hist0to31
, with c0to15
having the content of {0,1,2,3,4,...15}
.
One can/should interleave a few of these blocks to mitigate the ~7 cycle latency in load instructions and the 2 cycle latency between the basic arithmetic operations.
Depending on the application (like trying to find just Nth percentile of a histogram) one can use this method with multiple passes as in radix sort.
Scatter store/load instructions won't help in parallelising histogram accumulation either, unless the SIMD instruction set also has a quick method to remove/count the duplicate entries (and either compact/compress this kind of stream of tuples of offset, increment
or mask off work done on duplicate lanes). I've seen some papers/proposals for histogram accelerating SIMD instructions, but I'm not aware of any implementation.
input = [123 | 54 | 0 | 32 | 78 | 32 | 0 | 0 ]
--> offset 123, increment = 1
--> offset 54, increment = 1
--> offset 0, increment = 3
--> offset 32, increment = 2
--> offset 78, increment = 1
Upvotes: 2
Reputation: 11
Thanks, both Nate and artless for the comments. Yes, Nate, you´re right, it's just a basic histogram loop. As you both mention, it won't help vectorize it, case closed. Thanks, again.
Upvotes: 0