Reputation: 621

How to optimize simple gaussian filter for performance?

I am trying to write an android app which needs to calculate gaussian and laplacian pyramids for multiple full resolution images, i wrote this it on C++ with NDK, the most critical part of the code is applying gaussian filter to images abd i am applying this filter with horizontally and vertically.

The filter is (0.0625, 0.25, 0.375, 0.25, 0.0625) Since i am working on integers i am calculating (1, 4, 6, 4, 1)/16

dst[index] = ( src[index-2] + src[index-1]*4 + src[index]*6+src[index+1]*4+src[index+2])/16;

I have made a few simple optimization however it still is working slow than expected and i was wondering if there are any other optimization options that i am missing.

PS: I should mention that i have tried to write this filter part with inline arm assembly however it give 2x slower results.

//horizontal  filter
for(unsigned y = 0; y < height;  y++) {
    for(unsigned x = 2; x < width-2;  x++) {
        int index = y*width+x;
            dst[index].r = (src[index-2].r+ src[index+2].r + (src[index-1].r + src[index+1].r)*4 + src[index].r*6)>>4;
            dst[index].g = (src[index-2].g+ src[index+2].g + (src[index-1].g + src[index+1].g)*4 + src[index].g*6)>>4;
            dst[index].b = (src[index-2].b+ src[index+2].b + (src[index-1].b + src[index+1].b)*4 + src[index].b*6)>>4;                
     }
}
//vertical filter
for(unsigned y = 2;  y < height-2;  y++) {
    for(unsigned x = 0;  x < width;  x++) {
        int index = y*width+x;
            dst[index].r = (src[index-2*width].r + src[index+2*width].r  + (src[index-width].r + src[index+width].r)*4 + src[index].r*6)>>4;
            dst[index].g = (src[index-2*width].g + src[index+2*width].g  + (src[index-width].g + src[index+width].g)*4 + src[index].g*6)>>4;
            dst[index].b = (src[index-2*width].b + src[index+2*width].b  + (src[index-width].b + src[index+width].b)*4 + src[index].b*6)>>4;
     }
}

Upvotes: 1

Answers (3)

Thomas Matthews

Reputation: 57739

The index multiplication can be factored out of the inner loop since the mulitplicatation only occurs when y is changed:

for (unsigned y ...
{
    int index = y * width;
    for (unsigned int x...

You may gain some speed by loading variables before you use them. This would make the processor load them in the cache:

for (unsigned x = ...  
{  
    register YOUR_DATA_TYPE a, b, c, d, e;
    a = src[index - 2].r;
    b = src[index - 1].r;
    c = src[index + 0].r; // The " + 0" is to show a pattern.
    d = src[index + 1].r;
    e = src[index + 2].r;
    dest[index].r = (a + e + (b + d) * 4 + c * 6) >> 4;
    // ...

Another trick would be to "cache" the values of the src so that only a new one is added each time because the value in src[index+2] may be used up to 5 times.

So here is a example of the concepts:

//horizontal  filter
for(unsigned y = 0; y < height;  y++)
{
    int index = y*width + 2;
    register YOUR_DATA_TYPE a, b, c, d, e;
    a = src[index - 2].r;
    b = src[index - 1].r;
    c = src[index + 0].r; // The " + 0" is to show a pattern.
    d = src[index + 1].r;
    e = src[index + 2].r;
    for(unsigned x = 2; x < width-2;  x++)
    {
        dest[index - 2 + x].r = (a + e + (b + d) * 4 + c * 6) >> 4;
        a = b;
        b = c;
        c = d;
        d = e;
        e = src[index + x].r;

Upvotes: 4

Aki Suihkonen

Reputation: 20037

Some of the more obvious optimizations are exploiting the symmetry of the kernel:

a=*src++;    b=*src++;    c=*src++;    d=*src++;    e=*src++; // init

LOOP (n/5) times:
z=(a+e)+(b+d)<<2+c*6;  *dst++=z>>4;  // then reuse the local variables
a=*src++;
z=(b+a)+(c+e)<<2+d*6;  *dst++=z>>4;  // registers have been read only once...
b=*src++;
z=(c+b)+(d+a)<<2+e*6;  *dst++=z>>4;
e=*src++;

The second thing is that one can perform multiple additions using a single integer. When the values to be filtered are unsigned, one can fit two channels in a single 32-bit integer (or 4 channels in a 64-bit integer); it's the poor mans SIMD.

a=  0x[0011][0034]  <-- split to two 
b=  0x[0031][008a]
----------------------
sum    0042  00b0
>>4    0004  200b0  <-- mask off
mask   00ff  00ff   
-------------------
       0004  000b   <-- result

(The Simulated SIMD shows one addition followed by a shift by 4)

Here's a kernel that calculates 3 rgb operations in parallel (easy to modify for 6 rgb operations in 64-bit architectures...)

#define MASK (255+(255<<10)+(255<<20))
#define KERNEL(a,b,c,d,e) { \
 a=((a+e+(c<<1))>>2) & MASK; a=(a+b+c+d)>>2 & MASK; *DATA++ = a; a=DATA[4]; }

void calc_5_rgbs(unsigned int *DATA)
{
   register unsigned int a = DATA[0], b=DATA[1], c=DATA[2], d=DATA[3], e=DATA[4];
   KERNEL(a,b,c,d,e);
   KERNEL(b,c,d,e,a);
   KERNEL(c,d,e,a,b);
   KERNEL(d,e,a,b,c);
   KERNEL(e,a,b,c,d);
}

Works best on ARM and on 64-bit IA with 16 registers... Needs heavy assembler optimizations to overcome register shortage in 32-bit IA (e.g. use ebp as GPR). And just because of that it's an inplace algorithm...

There are just 2 guardian bits between every 8 bits of data, which is just enough to get exactly the same result as in integer calculation.

And BTW: it's faster to just run through the array byte per byte than by r,g,b elements

 unsigned char *s=(unsigned char *) source_array; 
 unsigned char *d=(unsigned char *) dest_array; 
 for (j=0;j<3*N;j++) d[j]=(s[j]+s[j+16]+s[j+8]*6+s[j+4]*4+s[j+12]*4)>>4;

Upvotes: 1

paddy

Reputation: 63481

I'm not sure how your compiler would optimize all this, but I tend to work in pointers. Assuming your struct is 3 bytes... You can start with pointers in the right places (the edge of the filter for source, and the destination for target), and just move them through using constant array offsets. I've also put in an optional OpenMP directive on the outer loop, as this can also improve things.

#pragma omp parallel for
for(unsigned y = 0; y < height;  y++) {
    const int rowindex = y * width;
    char * dpos = (char*)&dest[rowindex+2];
    char * spos = (char*)&src[rowindex];
    const char *end = (char*)&src[rowindex+width-2];

    for( ; spos != end;  spos++, dpos++) {
        *dpos = (spos[0] + spos[4] + ((spos[1] + src[3])<<2) + spos[2]*6) >> 4;
    }
}

Similarly for the vertical loop.

const int scanwidth = width * 3;
const int row1 = scanwidth;
const int row2 = row1+scanwidth;
const int row3 = row2+scanwidth;
const int row4 = row3+scanwidth;

#pragma omp parallel for
for(unsigned y = 2;  y < height-2;  y++) {
    const int rowindex = y * width;
    char * dpos = (char*)&dest[rowindex];
    char * spos = (char*)&src[rowindex-row2];
    const char *end = spos + scanwidth;

    for( ; spos != end;  spos++, dpos++) {
        *dpos = (spos[0] + spos[row4] + ((spos[row1] + src[row3])<<2) + spos[row2]*6) >> 4;
    }
}

This is how I do convolutions, anyway. It sacrifices readability a little, and I've never tried measuring the difference. I just tend to write them that way from the outset. See if that gives you a speed-up. The OpenMP definitely will if you have a multicore machine, and the pointer stuff might.

I like the comment about using SSE for these operations.

Upvotes: 2

How to optimize simple gaussian filter for performance?

Answers (3)

Related Questions