pilogo
pilogo

Reputation: 85

Neon code not optimized

I wrote some simple Neon intrinsics in for Android NDK.
Here is the code:

float32x4_t vec1;
float32x4_t vec2;
float32x4_t mulneon;
vec1 = vld1q_f32(&a1[0]);
vec2 = vld1q_f32(&a2[0]);
mulneon = vmulq_f32(vec1, vec2);

I expect to see some instructions like

vld1.32 {v0} ...
vld1.32 {v1} ...
vmul.f32 v0, v1, v0

But what I see is lot of ldr and str instructions followed by vmul. See below. My question is is vld1 not supported for android builds? or do I need to enable some other optimization

0x7f6ae33a20 <+792>:  ldr    x8, [sp, #0x198]
0x7f6ae33a24 <+796>:  ldr    q0, [x8]
0x7f6ae33a28 <+800>:  str    q0, [sp, #0x120]
0x7f6ae33a2c <+804>:  ldr    q0, [sp, #0x120]
0x7f6ae33a30 <+808>:  str    q0, [sp, #0x110]
0x7f6ae33a34 <+812>:  ldr    q0, [sp, #0x110]
0x7f6ae33a38 <+816>:  str    q0, [sp, #0x180]
0x7f6ae33a3c <+820>:  ldr    x8, [sp, #0x1a0]
0x7f6ae33a40 <+824>:  ldr    q0, [x8]
0x7f6ae33a44 <+828>:  str    q0, [sp, #0x100]
0x7f6ae33a48 <+832>:  ldr    q0, [sp, #0x100]
0x7f6ae33a4c <+836>:  str    q0, [sp, #0xf0]
0x7f6ae33a50 <+840>:  ldr    q0, [sp, #0xf0]
0x7f6ae33a54 <+844>:  str    q0, [sp, #0x170]
0x7f6ae33a58 <+848>:  ldr    x8, [sp, #0x228]
0x7f6ae33a5c <+852>:  ldr    x10, [sp, #0x198]
0x7f6ae33a60 <+856>:  add    x8, x10, x8, lsl #2
0x7f6ae33a64 <+860>:  str    x8, [sp, #0x198]
0x7f6ae33a68 <+864>:  ldr    x8, [sp, #0x250]
0x7f6ae33a6c <+868>:  ldr    x10, [sp, #0x1a0]
0x7f6ae33a70 <+872>:  add    x8, x10, x8, lsl #2
0x7f6ae33a74 <+876>:  str    x8, [sp, #0x1a0]
0x7f6ae33a78 <+880>:  ldr    q0, [sp, #0x170]
0x7f6ae33a7c <+884>:  str    q0, [sp, #0xe0]
0x7f6ae33a80 <+888>:  ldr    x8, [sp, #0x1a0]
0x7f6ae33a84 <+892>:  ldr    q0, [sp, #0xe0]
0x7f6ae33a88 <+896>:  ldr    s1, [x8]
0x7f6ae33a8c <+900>:  mov    v2.16b, v1.16b
0x7f6ae33a90 <+904>:  ins    v0.s[3], v2.s[0]
0x7f6ae33a94 <+908>:  str    q0, [sp, #0xd0]
0x7f6ae33a98 <+912>:  ldr    q0, [sp, #0xd0]
0x7f6ae33a9c <+916>:  str    q0, [sp, #0xc0]
0x7f6ae33aa0 <+920>:  ldr    q0, [sp, #0xc0]
0x7f6ae33aa4 <+924>:  str    q0, [sp, #0x170]
0x7f6ae33aa8 <+928>:  ldr    q0, [sp, #0x180]
0x7f6ae33aac <+932>:  ldr    q2, [sp, #0x170]
0x7f6ae33ab0 <+936>:  stur   q0, [x29, #-0xa0]
0x7f6ae33ab4 <+940>:  stur   q2, [x29, #-0xb0]
0x7f6ae33ab8 <+944>:  ldur   q0, [x29, #-0xa0]
0x7f6ae33abc <+948>:  ldur   q2, [x29, #-0xb0]
0x7f6ae33ac0 <+952>:  fmul   v0.4s, v0.4s, v2.4s

Upvotes: 1

Views: 156

Answers (1)

Problems:

  • It seems you compiled in debug mode.
  • It seems that the arrays are global variables or non-static local constants.
  • The Android Studio built-in Clang (v4.9) is extremely bad at generating efficient machine codes from intrinsics in the first place.

Solution:

  • Change the build type to Release
  • Use only local variables, especially inside loops, and if the constant arrays are local, declare them static.
  • Don't use Clang for intrinsics, or better, don't use intrinsics at all.

Upvotes: 2

Related Questions