Jean Azzopardi
Jean Azzopardi

Reputation: 2289

Speed up Matrix Addition in C#

I'd like to optimize this piece of code :

public void PopulatePixelValueMatrices(GenericImage image,int Width, int Height)
{            
        for (int x = 0; x < Width; x++)
        {
            for (int y = 0; y < Height; y++)
            {
                Byte  pixelValue = image.GetPixel(x, y).B;
                this.sumOfPixelValues[x, y] += pixelValue;
                this.sumOfPixelValuesSquared[x, y] += pixelValue * pixelValue;
            }
        }
}

This is to be used for image processing, and we're currently running this for about 200 images. We've optimized the GetPixel value to use unsafe code, and we're not using image.Width, or image.Height, as those properties were adding to our runtime costs.

However, we're still stuck at a low speed. The problem is that our images are 640x480, so the middle of the loop is being called about 640x480x200 times. I'd like to ask if there's a way to speed it up somehow, or convince me that it's fast enough as it is. Perhaps a way is through some fast Matrix Addition, or is Matrix Addition inherently an n^2 operation with no way to speed it up?

Perhaps doing array accesses via unsafe code would speed it up, but I'm not sure how to go about doing it, and whether it would be worth the time. Probably not. Thanks.

EDIT : Thank you for all your answers.

This is the GetPixel method we're using:

 public Color GetPixel(int x, int y)
    {
        int offsetFromOrigin = (y * this.stride) + (x * 3);
        unsafe
        {
            return Color.FromArgb(this.imagePtr[offsetFromOrigin + 2], this.imagePtr[offsetFromOrigin + 1], this.imagePtr[offsetFromOrigin]);
        }
    }

Upvotes: 12

Views: 3903

Answers (15)

Ben Voigt
Ben Voigt

Reputation: 283624

System.Drawing.Color is a structure, which on current versions of .NET kills most optimizations. Since you're only interested in the blue component anyway, use a method that only gets the data you need.

public byte GetPixelBlue(int x, int y)
{
    int offsetFromOrigin = (y * this.stride) + (x * 3);
    unsafe
    {
        return this.imagePtr[offsetFromOrigin];
    }
}

Now, exchange the order of iteration of x and y:

public void PopulatePixelValueMatrices(GenericImage image,int Width, int Height)
{            
    for (int y = 0; y < Height; y++)
    {
        for (int x = 0; x < Width; x++)
        {
            Byte  pixelValue = image.GetPixelBlue(x, y);
            this.sumOfPixelValues[y, x] += pixelValue;
            this.sumOfPixelValuesSquared[y, x] += pixelValue * pixelValue;
        }
    }
}

Now you're accessing all values within a scan line sequentially, which will make much better use of CPU cache for all three matrices involved (image.imagePtr, sumOfPixelValues, and sumOfPixelValuesSquared. [Thanks to Jon for noticing that when I fixed access to image.imagePtr, I broke the other two. Now the output array indexing is swapped to keep it optimal.]

Next, get rid of the member references. Another thread could theoretically be setting sumOfPixelValues to another array midway through, which does horrible horrible things to optimizations.

public void PopulatePixelValueMatrices(GenericImage image,int Width, int Height)
{          
    uint [,] sums = this.sumOfPixelValues;
    ulong [,] squares = this.sumOfPixelValuesSquared;
    for (int y = 0; y < Height; y++)
    {
        for (int x = 0; x < Width; x++)
        {
            Byte  pixelValue = image.GetPixelBlue(x, y);
            sums[y, x] += pixelValue;
            squares[y, x] += pixelValue * pixelValue;
        }
    }
}

Now the compiler can generate optimal code for moving through the two output arrays, and after inlining and optimization, the inner loop can step through the image.imagePtr array with a stride of 3 instead of recalculating the offset all the time. Now an unsafe version for good measure, doing the optimizations that I think .NET ought to be smart enough to do but probably isn't:

unsafe public void PopulatePixelValueMatrices(GenericImage image,int Width, int Height)
{          
    byte* scanline = image.imagePtr;
    fixed (uint* sums = &this.sumOfPixelValues[0,0])
    fixed (uint* squared = &this.sumOfPixelValuesSquared[0,0])
    for (int y = 0; y < Height; y++)
    {
        byte* blue = scanline;
        for (int x = 0; x < Width; x++)
        {
            byte pixelValue = *blue;
            *sums += pixelValue;
            *squares += pixelValue * pixelValue;
            blue += 3;
            sums++;
            squares++;
        }
        scanline += image.stride;
    }
}

Upvotes: 3

Skizz
Skizz

Reputation: 71060

This is a classic case of micro-optimisation failing horribly. You're not going to get anything from looking at that loop. To get real speed benefits you need to start off by looking at the big picture:-

  • Can you asynchronously preload image[n+1] whilst processing image[n]?
  • Can you load just the B channel from the image? This will decrease memory bandwidth?
  • Can you load the B value and update the sumOfPixelValues(Squared) arrays directly, i.e. read the file and update instead of read file, store, read, update? Again, this decreases memory bandwidth.
  • Can you use one dimensional arrays instead of two dimensional? Maybe create your own array class that works either way.
  • Perhaps you could look into using Mono and the SIMD extensions?
  • Can you process the image in chunks and assign them to idle CPUs in a multi-cpu environment?

EDIT:

Try having specialised image accessors so you're not wasting memory bandwidth:

public Color GetBPixel (int x, int y)
{
    int offsetFromOrigin = (y * this.stride) + (x * 3);
    unsafe
    {
        return this.imagePtr [offsetFromOrigin + 1];
    }
}

or, better still:

public Color GetBPixel (int offset)
{
    unsafe
    {
        return this.imagePtr [offset + 1];
    }
}

and use the above in a loop like:

for (int start_offset = 0, y = 0 ; y < Height ; start_offset += stride, ++y)
{
   for (int x = 0, offset = start_offset ; x < Width ; offset += 3, ++x)
   {
      pixel = GetBPixel (offset);
      // do stuff
   }
}

Upvotes: 0

dfa
dfa

Reputation: 116304

matrix's addition complexity is O(n^2), in number of additions.

However, since there are no intermediate results, you can parallelize the additions using threads:

  1. it easy to proof that the resulting algorithm will be lock-free
  2. you can tune the optimal number of threads to use

Upvotes: -1

Although it's a micro-optimization and thus may not add much you might want to study what the likelihood is of getting a zero when you do

Byte  pixelValue = image.GetPixel(x, y).B;

Clearly, if pixelValue = 0 then there's no reason to do the summations so your routine might become

public void PopulatePixelValueMatrices(GenericImage image,int Width, int Height)
  {
  for (int x = 0; x < Width; x++)
    {
    for (int y = 0; y < Height; y++)
      {
       Byte  pixelValue = image.GetPixel(x, y).B;

       if(pixelValue != 0)
         {
         this.sumOfPixelValues[x, y] += pixelValue;
         this.sumOfPixelValuesSquared[x, y] += pixelValue * pixelValue;
         }}}}

However, the question is how often you're going to see pixelValue=0, and whether the saving on the compute-and-store will offset the cost of the test.

Upvotes: 0

anirudhgarg
anirudhgarg

Reputation: 238

Read this article which also has some code and mentions about the slowness of GetPixel.

link text

From the article this is code to simply invert bits. This shows you the usage of LockBits as well.

It is important to note that unsafe code will not allow you to run your code remotely.

public static bool Invert(Bitmap b)
{

BitmapData bmData = b.LockBits(new Rectangle(0, 0, b.Width, b.Height), 
                               ImageLockMode.ReadWrite, PixelFormat.Format24bppRgb); 

int stride = bmData.Stride; 
System.IntPtr Scan0 = bmData.Scan0; 
unsafe 
{ 
    byte * p = (byte *)(void *)Scan0;
    int nOffset = stride - b.Width*3; 
    int nWidth = b.Width * 3;
    for(int y=0;y < b.Height;++y)
    {
        for(int x=0; x < nWidth; ++x )
        {
            p[0] = (byte)(255-p[0]);
            ++p;
        }
        p += nOffset;
    }
}

b.UnlockBits(bmData);

return true;

}

Upvotes: 6

Bytemaster
Bytemaster

Reputation: 176

Sometimes doing things in native C#, even unsafe calls, is just slower than using methods that have already been optimized.

No results guaranteed, but you may want to investigate the System.Windows.Media.Imaging name space and look at your whole problem in a different way.

Upvotes: 0

Paul
Paul

Reputation: 5406

Code profiling is the best place to start.

Matrix addition is a highly parallel operation and can be speed up by parallelizing the operation w/ multiple threads.

I would recommend using Intels IPP library that contains threaded highly optimized API for this sort of operation. Perhaps surprisingly it's only about $100 - but would add significant complexity to your project.

If you don't want to trouble yourself with mixed language programming and IPP, you could try out centerspace's C# math libraries. The NMath API contains easy to used, forward scaling, matrix operations.

Paul

Upvotes: 3

Yin Zhu
Yin Zhu

Reputation: 17119

If you only do matrix addition, you'd like to consider using multiple threads to speed up by taking advantage of multi-core processors. Also use one dimensional index instead of two.

If you want to do more complicated operations, you need to use a highly optimized math library, like NMath.Net, which uses native code rather than .net.

Upvotes: 0

selametonur
selametonur

Reputation: 132

I'm not sure if it's faster but you may write something like;

public void PopulatePixelValueMatrices(GenericImage image,int Width, int Height)
{            
        Byte pixelValue;
        for (int x = 0; x < Width; x++)
        {
            for (int y = 0; y < Height; y++)
            {
                pixelValue = image.GetPixel(x, y).B;
                this.sumOfPixelValues[x, y] += pixelValue;
                this.sumOfPixelValuesSquared[x, y] += pixelValue * pixelValue;
            }
        }
}

Upvotes: 0

Jon Skeet
Jon Skeet

Reputation: 1499840

Despite using unsafe code, GetPixel may well be the bottleneck here. Have you looked at ways of getting all the pixels in the image in one call rather than once per pixel? For instance, Bitmap.LockBits may be your friend...

On my netbook, a very simply loop iterating 640 * 480 * 200 times only take about 100 milliseconds - so if you're finding it's all going slowly, you should take another look at the bit inside the loop.

Another optimisation you might want to look at: avoid multi-dimensional arrays. They're significantly slower than single-dimensional arrays.

In particular, you can have a single-dimensional array of size Width * Height and just keep an index:

int index = 0;
for (int x = 0; x < Width; x++)
{
    for (int y = 0; y < Height; y++)
    {
        Byte pixelValue = image.GetPixel(x, y).B;
        this.sumOfPixelValues[index] += pixelValue;
        this.sumOfPixelValuesSquared[index] += pixelValue * pixelValue;
        index++;
    }
}

Using the same simple test harness, adding a write to a 2-D rectangular array took the total time of looping over 200 * 640 * 480 up to around 850ms; using a 1-D rectangular array took it back down to around 340ms - so it's somewhat significant, and currently you've got two of those per loop iteration.

Upvotes: 18

monksy
monksy

Reputation: 14234

About the only way to effectively speed up your matrix multiplication is to use the right algorithm. There are more efficient ways to speed up matrix multiplication.Take a look at the Stressen and Coopersmith Winograd algorithms. It is also noted [with the previous replies] that you can parallize the code, which helps quite a bit.

Upvotes: 0

Yuriy Faktorovich
Yuriy Faktorovich

Reputation: 68667

The only possible way I can think of to speed it up would be to try do some of the additions in parallel, which with your size might be beneficial over the threading overhead.

Upvotes: 1

Charles Bretana
Charles Bretana

Reputation: 146429

Where are images stored? If each is on disk, then a bit of your processing time issue may be in fetching them from the disk. You might examine this to see if it is an issue, and if so, then rewrite to pre-fetch the image data so that the array procesing code does not have to wait for the data...

If the overall application logic will allow it (Is each matrix addition independant, or dependant on output of a previous matrix addition?) If they are independant, I'd examine executing them all on separate threads, or in parallel..

Upvotes: 1

John Saunders
John Saunders

Reputation: 161773

I recommend that you profile this code and find out what's taking the most time.

You may find that it's the subscripting operation, in which case you might want to change your data structures from:

long sumOfPixelValues[n,m];
long sumOfPixelValuesSquared[n,m];

to

struct Sums
{
    long sumOfPixelValues;
    long sumOfPixelValuesSquared;
}

Sums sums[n,m];

This would depend on what you find once you profile the code.

Upvotes: 3

Henrik
Henrik

Reputation: 23324

Matrix addition is of course an n^2 operation but you can speed it up by using unsafe code or at least using jagged arrays instead of multidimensional.

Upvotes: 0

Related Questions