Reputation: 15406
Consider the two following pieces of code, the first is the C version :
void __attribute__((no_inline)) proj(uint8_t * line, uint16_t length)
{
uint16_t i;
int16_t tmp;
for(i=HPHD_MARGIN; i<length-HPHD_MARGIN; i++) {
tmp = line[i-3] - 4*line[i-2] + 5*line[i-1] - 5*line[i+1] + 4*line[i+2] - line[i+3];
hphd_temp[i]=ABS(tmp);
}
}
The second is the same function (except for the border) using neon intrinsics
void __attribute__((no_inline)) proj_neon(uint8_t * line, uint16_t length)
{
int i;
uint8x8_t b0b7, b8b15, p1p8,p2p9,p4p11,p5p12,p6p13, m4, m5;
uint16x8_t result;
m4 = vdup_n_u8(4);
m5 = vdup_n_u8(5);
b0b7 = vld1_u8(line);
for(i = 0; i < length - 16; i+=8) {
b8b15 = vld1_u8(line + i + 8);
p1p8 = vext_u8(b0b7,b8b15, 1);
p2p9 = vext_u8(b0b7,b8b15, 2);
p4p11 = vext_u8(b0b7,b8b15, 4);
p5p12 = vext_u8(b0b7,b8b15, 5);
p6p13 = vext_u8(b0b7,b8b15, 6);
result = vsubl_u8(b0b7, p6p13); //p[-3]
result = vmlal_u8(result, p2p9, m5); // +5 * p[-1];
result = vmlal_u8(result, p5p12, m4);// +4 * p[1];
result = vmlsl_u8(result, p1p8, m4); //-4 * p[-2];
result = vmlsl_u8(result, p4p11, m5);// -5 * p[1];
vst1q_s16(hphd_temp + i + 3, vabsq_s16(vreinterpretq_s16_u16(result)));
b0b7 = b8b15;
}
/* todo : remaining pixel */
}
I am disappointed by the performance gain : it is around 10 - 15 %. If I look at the generated assembly :
But one loop in the neon code computes 8 times as much data as an iteration through the C loop, so a dramatic improvement should be seen.
Do you have any explanation regarding the small difference between the two version ?
Additional details : Test data is a 10 Mpix image, computation time is around 2 seconds for the C version.
CPU : ARM cortex a8
Upvotes: 2
Views: 1074
Reputation: 12515
I'm going to take a wild guess and say that caching (data) is the reason you don't see the big performance gain you are expecting. While I don't know if your chipset supports caching or at what level, if the data spans cache lines, has poor alignment, or is running in an environment where the CPU is doing other things at the same time (interrupts, threads, etc.), then that also could muddy your results.
Upvotes: 4