CrazyDogLady
CrazyDogLady

Reputation: 83

C++ GDI+ bitmap manipulation needs speed up on byte operations

I'm using GDI+ in C++ to manipulate some Bitmap images, changing the colour and resizing the images. My code is very slow at one particular point and I was looking for some potential ways to speed up the line that's been highlighted in the VS2013 Profiler

for (UINT y = 0; y < 3000; ++y)
    {
        //one scanline at a time because bitmaps are stored wrong way up
        byte* oRow = (byte*)bitmapData1.Scan0 + (y * bitmapData1.Stride);
        for (UINT x = 0; x < 4000; ++x)
        {
            //get grey value from 0.114*Blue + 0.299*Red + 0.587*Green
            byte grey = (oRow[x * 3] * .114) + (oRow[x * 3 + 1] * .587) + (oRow[x * 3 + 2] * .299); //THIS LINE IS THE HIGHLIGHTED ONE

            //rest of manipulation code
        }
    }

Any handy hints on how to handle this arithmetic line better? It's causing massive slow downs in my code

Thanks in advance!

Upvotes: 0

Views: 398

Answers (4)

Jason Newton
Jason Newton

Reputation: 1211

In general I've found that more direct pointer management, intermediate instructions, less instructions (on most CPUs, they're all equal cost these days), and less memory fetches - e.g. tables are not the answer more often than they are - is the usual optimum, without going to direct assembly. Vectorization, especially explicit is also helpful as is dumping assembly of the function and confirming the inner bits conform to your expectations. Try this:

for (UINT y = 0; y < 3000; ++y)
{
    //one scanline at a time because bitmaps are stored wrong way up
    byte* oRow = (byte*)bitmapData1.Scan0 + (y * bitmapData1.Stride);
    byte *p = oRow;
    byte *pend = p + 4000 * 3;
    for(; p != pend; p+=3){
        const float grey = p[0] * .114f + p[1] * .587f + p[2] * .299f;
    }
    //alternatively with an autovectorizing compiler
    for(; p != pend; p+=3){
        #pragma unroll //or use a compiler option to unroll loops
        //make sure vectorization and relevant instruction sets are enabled - this is effectively a dot product so the following intrinsic fits the bill:
        //https://msdn.microsoft.com/en-us/library/bb514054.aspx
        //vector types or compiler intrinsics are more reliable often too... but get compiler specific or architecture dependent respectively.
        float grey = 0;
        const float w[3] = {.114f, .587f, .299f};
        for(int c = 0; c < 3; ++c){
            grey += w[c] * p[c];
        }
    }
}

Consider fooling around with OpenCL and targeting your CPU to see how fast you could solve with CPU specific optimizations and easily multiple cores - OpenCL covers this up for you pretty well and provides built in vector ops and dot product.

Upvotes: 0

Martin Schlott
Martin Schlott

Reputation: 4557

Optimization depends heavily on the used compiler and the target system. But there are some hints which may be usefull. Avoid multiplications:

Instead of:

byte grey = (oRow[x * 3] * .114) + (oRow[x * 3 + 1] * .587) + (oRow[x * 3 + 2] * .299); //THIS LINE IS THE HIGHLIGHTED ONE

use...

 //get grey value from 0.114*Blue + 0.299*Red + 0.587*Green
 byte grey = (*oRow) * .114;
 oRow++;
 grey += (*oRow) * .587;
 oRow++;
 grey += (*oRow) * .299;
 oRow++;

You can put the incrimination of the pointer in the same line. I put it in a separate line for better understanding.

Also, instead of using the multiplication of a float you can use a table, which can be faster than arithmetic. This depends on CPU und table size, but you can give it a shot:

// somwhere global or class attributes
byte tred[256];
byte tgreen[256];
byte tblue[256];

...at startup...

// Only init once at startup
// I am ignoring the warnings, you should not :-)
for(int i=0;i<255;i++)
{
  tred[i]=i*.114;
  tgreen[i]=i*.587;
  tblue[i]=i*.229;
}

...in the loop...

 byte grey = tred[*oRow];
 oRow++;
 grey += tgreen[*oRow];
 oRow++;
 grey += tblue[*oRow];
 oRow++;

Also. 255*255*255 is not such a great size. You can build one big table. As this Table will be larger than the usual CPU cache, I give it not such more speed efficiency.

Upvotes: 1

marcinj
marcinj

Reputation: 50046

You could premultiply values like: oRow[x * 3] * .114 and put them into an array. oRow[x*3] has 256 values, so you can easily create array aMul1 of 256 values from 0->255, and multiply it by .144. Then use aMul1[oRow[x * 3]] to find multiplied value. And the same for other components.

Actually you could even create such array for RGB values, ie. your pixel is 888, so you will need an array of size 256*256*256, which is 16777216 = ~16MB.Whether this would speed up your process, you would have to check yourself with profiler.

Upvotes: 0

cmaughan
cmaughan

Reputation: 2634

  • As suggested, you could do math in integer, but you could also try floats instead of doubles (.114f instead of .114), which are usually quicker and you don't need the precision.

  • Do the loop like this, instead, to save on pointer math. Creating a temporary pointer like this won't cost because the compiler will understand what you're up to.

    for(UINT x = 0; x < 12000; x+=3) { byte* pVal = &oRow[x]; .... }

  • This code is also easily threadable - the compiler can do it for you automatically in various ways; here's one, using parallel for: https://msdn.microsoft.com/en-us/library/dd728073.aspx If you have 4 cores, that's a 4x speedup, just about.

  • Also be sure to check release vs debug build - you don't know the perf until you run it in release/optimized mode.

Upvotes: 0

Related Questions