Reputation: 19

image proccessing further optimization

I'm new to optimization and was given a task to optimize a function that processes an image as much as possible. it takes an image, blurs it and then saves the blurred image, and then continues and sharpens the image, and saves also the sharpened image.

Here is my code:

typedef struct {
   unsigned char red;
   unsigned char green;
   unsigned char blue;
} pixel;

// I delete the other struct because we can do the same operations with     use of only addresses

//use macro instead of function is more efficient
#define calculateIndex(i, j, n) ((i)*(n)+(j))


// I combine all the functions in one because it is time consuming
void myfunction(Image *image, char* srcImgpName, char* blurRsltImgName,    char* sharpRsltImgName) {
    // use variable from type 'register int' is much more efficient from 'int'
    register int i,j, ii, jj, sum_red, sum_green, sum_blue; 
    //using local variable is much more efficient than using pointer to   pixels from the original image,and updat its value in each iteration
    pixel current_pixel , p;

    //dst will point on the first pixel in the image
    pixel* dst = (pixel*)image->data;

    int squareN = n*n;
    //instead of multiply by 3 - I used shift 
    register int sizeToAllocate = ((squareN)<<1)+(squareN); // use variable from type 'register int' is much more efficient from 'int'
    pixel* src = malloc(sizeToAllocate);

    register int index;

    //memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
    memcpy(src, dst, sizeToAllocate);

    ///////////////////////////////////////// first step : smooth //////////////////////////////////////////////////////////////////////


    /**the smooth blur is step that apply the blur-kernel (matrix of ints) over each pixel in the bouns - and make the image more smooth.
*this function was originally used this matrix :
* [1, 1, 1]
* [1, 1, 1]
* [1, 1, 1]
*because the matrix is full of 1 , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used 
*primitive variable.
*/

    //the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
    index = calculateIndex(1, 1, n);
    for (i = 1 ; i < n - 1; ++i) {
        for (j =  1 ; j < n - 1 ; ++j) {
            // I used this variables as counters to the colors' values around a specific pixel
            sum_red = 0;
            sum_green = 0;
            sum_blue = 0;

            for(ii = i-1; ii <= i+1; ++ii) {
                for(jj =j-1; jj <= j+1; ++jj) {
                    //take care of the [ii,jj] pixel in the matrix
                    //calculate the adrees of the current pixel
                    pixel p = src[calculateIndex(ii, jj, n)];       
                    //sum the colors' values of the neighbors of the current pixel
                    sum_red += p.red;
                    sum_green +=  p.green;
                    sum_blue += p.blue;
                }
            }
            //calculate the avarage of the colors' values around the current pixel - as written in the instructions
            sum_red = (((sum_red) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
            sum_green = (((sum_green) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
            sum_blue = (((sum_blue) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient

            current_pixel.red = (unsigned char)sum_red;
            current_pixel.green = (unsigned char)sum_green;
            current_pixel.blue = (unsigned char)sum_blue;
            dst[index++] = current_pixel;
        }
    }
    // write result image to file
    writeBMP(image, srcImgpName, blurRsltImgName);
    
    //memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
    memcpy(src, dst, sizeToAllocate);


    ///////////////////////////////////////// second step : sharp //////////////////////////////////////////////////////////////////////



    /** I want to sharp the smooth image . In this step I apply the sharpen kernel (matrix of ints) over each pixel in the bouns - and make the image more sharp.
*this function was originally used this matrix :
* [-1, -1, -1]
* [-1, 9, -1]
* [-1, -1, -1]
*because the matrix is full of (-1) , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used 
*primitive variable. I operato like that : insted of multiply in (-1) in the end of the step , I define counter initializes with zero , and
*substruct all te colors' values from it. the result is actually the same as multiply by (-1), in more efficient way.
*/

    //the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
    for (i = 1 ; i < n-1; ++i) {
        for (j =  1 ; j < n-1 ; ++j) {
            // I used this variables as counters to the colors' values around a specific pixel
            sum_red = 0;
            sum_green = 0;
            sum_blue = 0;

            // Do central pixel first
            p=src[calculateIndex(i,j,n)];
            sum_red   = 10*p.red;
            sum_green = 10*p.green;
            sum_blue  = 10*p.blue;

            for(ii =i-1; ii <= i + 1; ++ii) {
                for(jj = j-1; jj <= j + 1; ++jj) {
                    p = src[calculateIndex(ii, jj, n)];
                    //operate according to the instructions
                    sum_red -= p.red;
                    sum_green -= p.green;
                    sum_blue -= p.blue;
                }
            }

            //each pixel's colors' values must match the range [0,255] - I used the idea from the original code

            //the red value must be in the range [0,255]
            if (sum_red < 0) {
                sum_red = 0;
            } else if (sum_red > 255 ) {
                sum_red = 255;
            }
            current_pixel.red = (unsigned char)sum_red;


            //the green value must be in the range [0,255]
            if (sum_green < 0) {
            sum_green = 0;
            } else if (sum_green > 255 ) {
            sum_green = 255;
            }
            current_pixel.green = (unsigned char)sum_green;


            //the blue value must be in the range [0,255]
            if (sum_blue < 0) {
                sum_blue = 0;
            } else if (sum_blue > 255 ) {
                sum_blue = 255;
            }
            current_pixel.blue = (unsigned char)sum_blue;


            // put the updated pixel in [i,j] in the image
            dst[calculateIndex(i, j, n)] = current_pixel;
        }
    }

    //free the allocated space to prevent memory leaks
    free(src);

    // write result image to file
    writeBMP(image, srcImgpName, sharpRsltImgName);
}

I wanted to ask about the if statements, is there anything better that can replace those? And also more generally speaking can anyone spot an optimization mistakes here, or can offer his inputs?

Thanks a lot!

updated code:

typedef struct {
   unsigned char red;
   unsigned char green;
   unsigned char blue;
} pixel;

// I delete the other struct because we can do the same operations with use of only addresses

//use macro instead of function is more efficient
#define calculateIndex(i, j, n) ((i)*(n)+(j))


// I combine all the functions in one because it is time consuming
void myfunction(Image *image, char* srcImgpName, char* blurRsltImgName, char* sharpRsltImgName) {
    // use variable from type 'register int' is much more efficient from 'int'
register int i,j, ii, jj, sum_red, sum_green, sum_blue; 
    //using local variable is much more efficient than using pointer to pixels from the original image,and updat its value in each iteration
    pixel current_pixel , p;

    //dst will point on the first pixel in the image
    pixel* dst = (pixel*)image->data;

    int squareN = n*n;
    //instead of multiply by 3 - I used shift 
    register int sizeToAllocate = ((squareN)<<1)+(squareN); // use variable from type 'register int' is much more efficient from 'int'
    pixel* src = malloc(sizeToAllocate);

    register int index;

    //memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
    memcpy(src, dst, sizeToAllocate);

    ///////////////////////////////////////// first step : smooth //////////////////////////////////////////////////////////////////////


    /**the smooth blur is step that apply the blur-kernel (matrix of ints) over each pixel in the bouns - and make the image more smooth.
*this function was originally used this matrix :
* [1, 1, 1]
* [1, 1, 1]
* [1, 1, 1]
*because the matrix is full of 1 , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used 
*primitive variable.
*/

    //the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
    index = calculateIndex(1, 1, n);
    for (i = 1 ; i < n - 1; ++i) {
        for (j =  1 ; j < n - 1 ; ++j) {
            // I used this variables as counters to the colors' values around a specific pixel
            sum_red = 0;
            sum_green = 0;
            sum_blue = 0;

            for(ii = i-1; ii <= i+1; ++ii) {
                for(jj =j-1; jj <= j+1; ++jj) {
                    //take care of the [ii,jj] pixel in the matrix
                    //calculate the adrees of the current pixel
                    pixel p = src[calculateIndex(ii, jj, n)];       
                    //sum the colors' values of the neighbors of the current pixel
                    sum_red += p.red;
                    sum_green +=  p.green;
                    sum_blue += p.blue;
                }
            }
            //calculate the avarage of the colors' values around the current pixel - as written in the instructions
            sum_red = (((sum_red) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
            sum_green = (((sum_green) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
            sum_blue = (((sum_blue) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient

            current_pixel.red = (unsigned char)sum_red;
            current_pixel.green = (unsigned char)sum_green;
            current_pixel.blue = (unsigned char)sum_blue;
            dst[index++] = current_pixel;
        }
        index += 2;
    }
    // write result image to file
    writeBMP(image, srcImgpName, blurRsltImgName);
    
    //memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
    memcpy(src, dst, sizeToAllocate);


    ///////////////////////////////////////// second step : sharp //////////////////////////////////////////////////////////////////////



    /** I want to sharp the smooth image . In this step I apply the sharpen kernel (matrix of ints) over each pixel in the bouns - and make the image more sharp.
*this function was originally used this matrix :
* [-1, -1, -1]
* [-1, 9, -1]
* [-1, -1, -1]
*because the matrix is full of (-1) , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used 
*primitive variable. I operato like that : insted of multiply in (-1) in the end of the step , I define counter initializes with zero , and
*substruct all te colors' values from it. the result is actually the same as multiply by (-1), in more efficient way.
*/

    index = calculateIndex(1,1,n);
    //the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
    for (i = 1 ; i < n-1; ++i) {
        for (j =  1 ; j < n-1 ; ++j) {
            // I used this variables as counters to the colors' values around a specific pixel
            sum_red = 0;
            sum_green = 0;
            sum_blue = 0;

            // Do central pixel first
            p=src[index];
            sum_red   = 10*p.red;
            sum_green = 10*p.green;
            sum_blue  = 10*p.blue;

            for(ii =i-1; ii <= i + 1; ++ii) {
                for(jj = j-1; jj <= j + 1; ++jj) {
                    p = src[calculateIndex(ii, jj, n)];
                    //operate according to the instructions
                    sum_red -= p.red;
                    sum_green -= p.green;
                    sum_blue -= p.blue;
                }
                index += 2;
            }

            //each pixel's colors' values must match the range [0,255] - I used the idea from the original code

            //the red value must be in the range [0,255]
            if (sum_red < 0) {
                sum_red = 0;
            } else if (sum_red > 255 ) {
                sum_red = 255;
            }
            current_pixel.red = (unsigned char)sum_red;


            //the green value must be in the range [0,255]
            if (sum_green < 0) {
                sum_green = 0;
            } else if (sum_green > 255 ) {
                sum_green = 255;
            }
            current_pixel.green = (unsigned char)sum_green;


            //the blue value must be in the range [0,255]
            if (sum_blue < 0) {
                sum_blue = 0;
            } else if (sum_blue > 255 ) {
                sum_blue = 255;
            }
            current_pixel.blue = (unsigned char)sum_blue;


            // put the updated pixel in [i,j] in the image
            dst[calculateIndex(i, j, n)] = current_pixel;
        }
    }

    //free the allocated space to prevent memory leaks
    free(src);

    // write result image to file
    writeBMP(image, srcImgpName, sharpRsltImgName);
}

------------------------------------------------------------------------------updated code:

typedef struct {
   unsigned char red;
   unsigned char green;
   unsigned char blue;
} pixel;

// I delete the other struct because we can do the same operations with use of only addresses

//use macro instead of function is more efficient
#define calculateIndex(i, j, n) ((i)*(n)+(j))


// I combine all the functions in one because it is time consuming
void myfunction(Image *image, char* srcImgpName, char* blurRsltImgName, char* sharpRsltImgName) {
    // use variable from type 'register int' is much more efficient from 'int'
    register int i,j, ii, jj, sum_red, sum_green, sum_blue; 
    //using local variable is much more efficient than using pointer to pixels from the original image,and updat its value in each iteration
    pixel current_pixel , p;

    //dst will point on the first pixel in the image
    pixel* dst = (pixel*)image->data;

    int squareN = n*n;
    //instead of multiply by 3 - I used shift 
    register int sizeToAllocate = ((squareN)<<1)+(squareN); // use    variable from type 'register int' is much more efficient from 'int'
    pixel* src = malloc(sizeToAllocate);

    register int index;

    //memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
    memcpy(src, dst, sizeToAllocate);

    ///////////////////////////////////////// first step : smooth //////////////////////////////////////////////////////////////////////


    /**the smooth blur is step that apply the blur-kernel (matrix of ints) over each pixel in the bouns - and make the image more smooth.
*this function was originally used this matrix :
* [1, 1, 1]
* [1, 1, 1]
* [1, 1, 1]
*because the matrix is full of 1 , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used 
*primitive variable.
*/

    //the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
    index = n + 1;
    for (i = 1 ; i < n - 1; ++i) {
        for (j =  1 ; j < n - 1 ; ++j) {
            // I used this variables as counters to the colors' values around a specific pixel
            sum_red = 0;
            sum_green = 0;
            sum_blue = 0;

            for(ii = i-1; ii <= i+1; ++ii) {
                for(jj =j-1; jj <= j+1; ++jj) {
                    //take care of the [ii,jj] pixel in the matrix
                    //calculate the adrees of the current pixel
                    pixel p = src[calculateIndex(ii, jj, n)];       
                    //sum the colors' values of the neighbors of the current pixel
                    sum_red += p.red;
                    sum_green +=  p.green;
                    sum_blue += p.blue;
                }
            }
            //calculate the avarage of the colors' values around the current pixel - as written in the instructions
            sum_red = (((sum_red) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
            sum_green = (((sum_green) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
            sum_blue = (((sum_blue) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient

            current_pixel.red = (unsigned char)sum_red;
            current_pixel.green = (unsigned char)sum_green;
            current_pixel.blue = (unsigned char)sum_blue;
            dst[index++] = current_pixel;
        }
        index += 2;
    }
    // write result image to file
    writeBMP(image, srcImgpName, blurRsltImgName);
    
    //memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
    memcpy(src, dst, sizeToAllocate);


    ///////////////////////////////////////// second step : sharp //////////////////////////////////////////////////////////////////////



    /** I want to sharp the smooth image . In this step I apply the sharpen kernel (matrix of ints) over each pixel in the bouns - and make the image more sharp.
*this function was originally used this matrix :
* [-1, -1, -1]
* [-1, 9, -1]
* [-1, -1, -1]
*because the matrix is full of (-1) , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used 
*primitive variable. I operate like that : instead of multiply in (-1) in the end of the step , I define counter initializes with zero , and
*substruct all te colors' values from it. the result is actually the same as multiply by (-1), in more efficient way.
*/

    index = calculateIndex(1,1,n);
    //the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
    for (i = 1 ; i < n-1; ++i) {
        for (j =  1 ; j < n-1 ; ++j) {
            // I used this variables as counters to the colors' values around a specific pixel
            sum_red = 0;
            sum_green = 0;
            sum_blue = 0;

            // Do central pixel first
            p=src[index];
            sum_red   = 10*p.red;
            sum_green = 10*p.green;
            sum_blue  = 10*p.blue;

            for(ii =i-1; ii <= i + 1; ++ii) {
                for(jj = j-1; jj <= j + 1; ++jj) {
                    p = src[calculateIndex(ii, jj, n)];
                    //operate according to the instructions
                    sum_red -= p.red;
                    sum_green -= p.green;
                    sum_blue -= p.blue;
                }
            }

            //each pixel's colors' values must match the range [0,255] - I used the idea from the original code

            //the red value must be in the range [0,255]
            if (sum_red < 0) {
                sum_red = 0;
            } else if (sum_red > 255 ) {
                sum_red = 255;
            }
            current_pixel.red = (unsigned char)sum_red;


            //the green value must be in the range [0,255]
            if (sum_green < 0) {
                sum_green = 0;
            } else if (sum_green > 255 ) {
                sum_green = 255;
            }
            current_pixel.green = (unsigned char)sum_green;


            //the blue value must be in the range [0,255]
            if (sum_blue < 0) {
                sum_blue = 0;
            } else if (sum_blue > 255 ) {
                sum_blue = 255;
            }
            current_pixel.blue = (unsigned char)sum_blue;


            // put the updated pixel in [i,j] in the image
            dst[calculateIndex(i, j, n)] = current_pixel;
        }
        index += 2;
    }

    //free the allocated space to prevent memory leaks
    free(src);

    // write result image to file
    writeBMP(image, srcImgpName, sharpRsltImgName);
}

Upvotes: 0

Answers (2)

Andrew Henle

Reputation: 1

Some general optimization guidelines:

If you're running on x86, compile as a 64-bit binary. x86 is really a register-starved CPU. In 32-bit mode you pretty much have only 5 or 6 32-bit general-purpose registers available, and you only get "all" 6 if you compile with optimizations like -fomit-frame-pointer on GCC. In 64-bit mode you'll have 13 or 14 64-bit general-purpose registers.
Get a good compiler and use the highest possible general optimization level.
Profile! Profile! Profile! Actually profile your code so actually know where the performance bottlenecks are. Any guesses about the location of any performance bottlenecks are likely wrong.
Once you find your bottlenecks, examine the actual instructions the compiler produces and look at the bottleneck areas, just to see what's happening. Perhaps the bottleneck is where the compiler had to do a lot of register spilling and filling because of register pressure. This can be really helpful if you can profile down to the instruction level.
Use the insights from the profiling and examination of the generated instructions to improve your code and compile arguments. For example, if you're seeing a lot of register spilling and filling, you need to reduce register pressure, perhaps by manually coalescing loops or disabling prefetching with a compiler option.
Experiment with different page size options. If a single row of pixels is a significant fraction of a page size, reaching into other rows is more likely to reach into another page and result in a TLB miss. Using larger memory pages may significantly reduce this.

Some specific ideas for your code:

Use only one outer loop. You'll have to experiment to find the fastest way to handle your "extra" edge pixels. The fastest way might be to not do anything special, roll right over them like "normal" pixels, and just ignore the values in them later.
Manually unroll the two inner loops - you're only doing 9 pixels.
Don't use calculateIndex() - use the address of the current pixel and find the other pixels simply by subtracting or adding the proper value from the current pixel address. For example, the address of the upper-left pixel in your inner loops would be something like currentPixelAddress - n - 1.

Those would convert your four-deep nested loops into a single loop with very little index calculations needed.

Upvotes: 1

Mark Setchell

Reputation: 207670

A few ideas - untested.

You have if(ii==i && jj=j) to test for the central pixel in your sharpening loop which you do 9x for every pixel. I think it would be faster to remove that if and do exactly the same for every pixel but then make a correction, outside the loop by adding 10x the central pixel.

    // Do central pixel first
    p=src[calculateIndex(i,j,n)];
    sum_red   = 10*p.red;
    sum_green = 10*p.green;
    sum_blue  = 10*p.blue;

    for(ii =i-1; ii <= i + 1; ++ii) {
        for(jj = j-1; jj <= j + 1; ++jj) {
            p = src[calculateIndex(ii, jj, n)];
            //operate according to the instructions
            sum_red -= p.red;
            sum_green -= p.green;
            sum_blue -= p.blue;
        }
    }

Where you do dst[calculateIndex(i, j, n)] = current_pixel;, you can probably calculate the index once before the loop at the start and then just increment the pointer with each write inside the loop - assuming your arrays are contiguous and unpadded.

index=calculateIndex(1,1,n)
for (i = 1 ; i < n - 1; ++i) {
    for (j =  1 ; j < n - 1 ; ++j) {
        ...
        dst[index++] = current_pixel;
    }
    index+=2; // skip over last pixel of this line and first pixel of next line
}

As you move your 3x3 window of 9 pixels across the image, you could "remember" the left-most column of 3 pixels from the previous position, then instead of 9 additions for each pixel, you would do a single subtraction for the left-most column leaving the window and 3 additions for the new column entering the window on the right side, i.e. 4 calculations instead of 9.

Upvotes: 0

image proccessing further optimization

Answers (2)

Related Questions