Reputation: 11
hello,What is the difference between vfmaq_f32 and vmlaq_f32 in the neon instruction set, and the difference in running speed and accuracy
On macOS ARM64, the code runs consistently
#include<arm_neon.h>
#include<iostream>
using namespace std;
int main(){
float a = 12.3839467819;
float b = 21.437678904;
float c = 4171.42144;
printf("%.17f\n",a);
printf("%.17f\n",b);
printf("%.17f\n",c);
printf("%.17f\n",a+b*c);
float32x4_t a_reg = vdupq_n_f32(a);
float32x4_t b_reg = vdupq_n_f32(b);
float32x4_t c_reg = vdupq_n_f32(c);
float32x4_t res_reg = vfmaq_f32(a_reg, b_reg, c_reg);
float res[4] = {0.f};
vst1q_f32(res,res_reg);
printf("%.17f\n",res[0]);
res_reg = vmlaq_f32(a_reg, b_reg, c_reg);
vst1q_f32(res,res_reg);
printf("%.17f\n",res[0]);
res_reg = vmulq_f32(b_reg, c_reg);
res_reg = vaddq_f32(res_reg, a_reg);
vst1q_f32(res,res_reg);
printf("%.17f\n",res[0]);
return 0;
}
Upvotes: -1
Views: 635
Reputation: 17502
vfmaq_f32
defined as a single fused operation, whereas vmlaq_f32
can be implemented with a multiply then an accumulate.
Two explanations come to mind. First, at some point the fused version (the FMLA
instruction) was possibly an optional instruction (I don't know when, and I'm a bit too lazy to dig through really old documentation). The second possibility, which seems more likely, is that the fused version may have, at some point, been a bit slower.
These days, it looks like compilers pretty much just compile both to the same instruction so it's effectively an alias, but you should probably still use vfmaq_f32
if you want accuracy, but vmlaq_f32
if you are more interested in speed.
Upvotes: 0