Reputation: 180
I was wondering if you could help me use NEON intrinsics to optimize this mask function. I already tried to use auto-vectorization using the O3 gcc compiler flag but the performance of the function was smaller than running it with O2, which turns off the auto-vectorization. For some reason the assembly code produced with O3 is 1,5 longer than the one with O2.
void mask(unsigned int x, unsigned int y, uint32_t *s, uint32_t *m)
{
unsigned int ixy;
ixy = xsize * ysize;
while (ixy--)
*(s++) &= *(m++);
}
Probably I have to use the following commands:
vld1q_u32 // to load 4 integers from s and m
vandq_u32 // to execute logical and between the 4 integers from s and m
vst1q_u32 // to store them back into s
However i don't know how to do it in the most optimal way. For instance should I increase s,m by 4 after loading , anding and storing? I am quite new to NEON so I would really need some help.
I am using gcc 4.8.1 and I am compiling with the following cmd:
arm-linux-gnueabihf-gcc -mthumb -march=armv7-a -mtune=cortex-a9 -mcpu=cortex-a9 -mfloat-abi=hard -mfpu=neon -O3 -fprefetch-loop-arrays name.c -o name
Thanks in advance
Upvotes: 1
Views: 641
Reputation: 8715
I would probably do it like this. I've included 4x loop unrolling. Preloading the cache is always a good idea and can speed things up another 25%. Since there's not much processing going on (it's mostly spending time loading and storing), it's best to load lots of registers, then process them as it gives time for the data to actually load. It assumes the data is an even multiple of 16 elements.
void fmask(unsigned int x, unsigned int y, uint32_t *s, uint32_t *m)
{
unsigned int ixy;
uint32x4_t srcA,srcB,srcC,srcD;
uint32x4_t maskA,maskB,maskC,maskD;
ixy = xsize * ysize;
ixy /= 16; // process 16 at a time
while (ixy--)
{
__builtin_prefetch(&s[64]); // preload the cache
__builtin_prefetch(&m[64]);
srcA = vld1q_u32(&s[0]);
maskA = vld1q_u32(&m[0]);
srcB = vld1q_u32(&s[4]);
maskB = vld1q_u32(&m[4]);
srcC = vld1q_u32(&s[8]);
maskC = vld1q_u32(&m[8]);
srcD = vld1q_u32(&s[12]);
maskD = vld1q_u32(&m[12]);
srcA = vandq_u32(srcA, maskA);
srcB = vandq_u32(srcB, maskB);
srcC = vandq_u32(srcC, maskC);
srcD = vandq_u32(srcD, maskD);
vst1q_u32(&s[0], srcA);
vst1q_u32(&s[4], srcB);
vst1q_u32(&s[8], srcC);
vst1q_u32(&s[12], srcD);
s += 16;
m += 16;
}
}
Upvotes: 2
Reputation: 339
I would start with the simplest one and take it as a reference for compare with future routines.
A good rule of thumb is to calculate needed things as soon as possible, not exactly when needed. This means that instructions can take X cycles to execute, but the results are not always immediately ready, so scheduling is important
As an example, a simple scheduling schema for your case would be (pseudocode)
nn=n/4 // Assuming n is a multiple of 4
LOADI_S(0) // Load and immediately after increment pointer
LOADI_M(0) // Load and immediately after increment pointer
for( k=1; k<nn;k++){
AND_SM(k-1) // Inner op
LOADI_S(k) // Load and increment after
LOADI_M(k) // Load and increment after
STORE_S(k-1) // Store and increment after
}
AND_SM(nn-1)
STORE_S(nn-1) // Store. Not needed to increment
Leaving out these instructions from the inner loop we achieve that the ops inside don't depend on the result of the previous op. This schema can be further extended in order to take profit of the time that otherwise would be lost waiting for the result of the previous op.
Also, as intrinsics still depend on the optimizer, see what does the compiler do under different optimization options. I prefer to use inline assembly, which is not difficult for small routines, and give you more control.
Upvotes: 0