Reputation: 5232
I am currently playing with ARM Neon and have the written the following functions, one in C, one with NEON Intrinsics to compare the speeds. The functions compare two arrays. The parameter cb
is the number of bytes divided by 8:
inline uint32_t is_not_zero(uint32x4_t v)
{
uint32x2_t tmp = vorr_u32(vget_low_u32(v), vget_high_u32(v));
return vget_lane_u32(vpmax_u32(tmp, tmp), 0);
}
uint32_t sum_neon(const uint8_t *s1, const uint8_t *s2, uint32_t cb)
{
const uint32_t *s1_cmp = (uint32_t *)s1;
const uint32_t *s2_cmp = (uint32_t *)s2;
cb *= 2;
while (cb--)
{
uint32x4x2_t cmp1 = vld2q_u32(s1_cmp);
uint32x4x2_t cmp2 = vld2q_u32(s2_cmp);
uint32x4_t res1 = vceqq_u32(cmp1.val[0], cmp2.val[0]);
uint32x4_t res2 = vceqq_u32(cmp1.val[1], cmp2.val[1]);
if (!is_not_zero(res1)) return 1;
if (!is_not_zero(res2)) return 1;
s1_cmp += 8;
s2_cmp += 8;
}
return 0;
}
uint32_t sum_c(const uint8_t *s1, const uint8_t *s2, uint32_t cb)
{
const uint64_t *p1 = (uint64_t *)s1;
const uint64_t *p2 = (uint64_t *)s2;
uint32_t n = 0;
while (cb--) {
if ((p1[n ] != p2[n ]) ||
(p1[n+1] != p2[n+1]) ||
(p1[n+2] != p2[n+2]) ||
(p1[n+3] != p2[n+3])) return 1;
++n;
}
return 0;
}
I dont understand why the C implementation is WAY faster than the NEON variant. The code is compiled on a raspberry pi using
-O3 -mcpu=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard
as CFlags.
Upvotes: 0
Views: 400
Reputation: 21
Bugs notwithstanding, (see below), the answer to this sort of question may be ultimately hardware dependent and whether or not you are using arm32 neon or arm64 neon, since the ISA is different. I'm going to assume we are looking at arm32 here, because you did not use the 4 lane arm64 max instruction, which would have saved some operations converting a uint32x4_t to bool/uint32_t. With arm32, we are at risk of running on in order processors, so pipeline latency might be a big factor. It could be up to 8 cycles / instruction, and you've only unrolled by 2. Also, some designs have the vector unit running a bit behind the scalar units by 10 cycles or so, so every time you move data from the vector unit to scalar, as you do in "if (!is_not_zero(res1))..." you'll take a 10 cycle stall, or whatever the delay is. That can be a killer. Finding out what is going on is best done with a sampler and looking at where the samples land in assembly and interpreting the tea leaves. Finding a sampler that will show you assembly might be its own challenge.
Ultimately, whether you are on 32-bit or 64-bit arm, the process of reducing simd registers down to something that will fit in a condition register is expensive on ARM. They don't have instructions like Intel's PTEST or Altivec's dot instructions that directly move data to the condition register. Even if there isn't a delay between scalar unit and vector unit, the N-instruction sequence is going to kill you. You just can't do this reduction that often. You could instead just OR a bunch of vectors together and more infrequently check to see if any lanes in the "sum" are non-zero. So, for example, the very first thing to try would be to replace:
if (!is_not_zero(res1)) return 1;
if (!is_not_zero(res2)) return 1;
with
if (!is_not_zero(res1 | res2)) return 1;
Bug?: Also, I think your vector line cb *= 2 is wrong and should probably be cb /= 4 to correct for the difference in size between uint64_t and uint32x4x2_t. Assuming you don't crash, this error would quadruple your times. On the other hand, I feel that the ++n in the scalar code is similarly in error -- should be n+=4? -- so perhaps I don't fully understand what you are trying to accomplish. There seems like some redundant work going on here.
Upvotes: 0