Faster way to combine bitmaps with custom algorithm?

I’m trying to combine 2 images with a certain algorithm. But in its current state it is too slow. It takes about 70ms to combine two 512x512 images. This is OK but as soon as the images get bigger the time it takes to combine them increases.

This is the code in c# (Fast work with Bitmaps in C#)

var t = new Vec3f(0);
var u = new Vec3f(0);
var r = new Vec3f(0);

for (int i = 0; i < bData1.Height; ++i)
    for (int j = 0; j < bData1.Width; ++j)
        byte* dataBase = bData1Scan0Ptr + i * bData1.Stride + j * m_BitsPerPixel / 8;
        byte* dataDetail = bData2Scan0Ptr + i * bData2.Stride + j * m_BitsPerPixel / 8;

        byte* dataCombined = bDataCombinedScan0Ptr + i * bDataCombined.Stride + j * m_BitsPerPixel / 8;

        t.x = (dataBase[2] / 255.0f) * 2.0f - 1.0f;
        t.y = (dataBase[1] / 255.0f) * 2.0f - 1.0f;
        t.z = (dataBase[0] / 255.0f) * 2.0f;

        u.x = (dataDetail[2] / 255.0f) * -2.0f + 1.0f;
        u.y = (dataDetail[1] / 255.0f) * -2.0f + 1.0f;
        u.z = (dataDetail[0] / 255.0f) * 2.0f - 1.0f;

        r = t * t.Dot(u) - u * t.z;


        //Write data to our new bitmap
        dataCombined[2] = (byte)Math.Round((r.x * 0.5f + 0.5f) * 255.0f);
        dataCombined[1] = (byte)Math.Round((r.y * 0.5f + 0.5f) * 255.0f);
        dataCombined[0] = (byte)Math.Round((r.z * 0.5f + 0.5f) * 255.0f);

        m_VectorImageArray[index, i, j] = t;    //base
        m_VectorImageArray[index + 1, i, j] = u;  //detail
        m_VectorImageArray[index + 2, i, j] = r;  //Combined


Because I wanted to speed this up I also tried to make a c++ dll and load it in my C# project with DLLImport. I've implemented this vector class ( thinking it would result in a significant speed gain, but unfortunately it turned out to be only ~10ms faster.

I want to make this faster because I’d like to update the image real-time (looping over the vectors which are stored in m_VectorImageArray).

The problem isn't related to reading/writing to the bitmap but more to the algorithm itself. I don’t think I can use a parallel.for because the pixels need to be in the exact same order, or is this possible after all?

Upvotes: 0

I have just added another answer to the StackOverflow-post you mentioned in your question. Fast work with Bitmaps

It tells you how to work directly with the Bitmap data in an Integer-array or a Byte-array without copying anything. It should save you quite some time. You can acually save time by working with an Integer array instead of Bytes because it takes fewer operations to read and write. All you need is some bitshifting magic which you will also find in the post I linked to.

Make sure to do as few type conversions as possible inside the loop as they are quite expensive.

I also agree with Fredou that you should look more carefully at the two lines:

r = t * t.Dot(u) - u * t.z;


You could try to unroll the functions to save some time. Create variables outside the loop:

float rx, ry, rz;
float tx, ty, tz;
float ux, uy, uz;
float dot, len;

And then in the loop:

dot = tx*ux + ty*uy + tz*uz;
rx = tx * dot - ux*tz;
ry = ty * dot - uy*tz;
rz = tz * dot - uz*tz;

len = Math.sqrt(rx*rx + ry*ry + rz*rz);
rx /= len;
ry /= len;
rz /= len;

If you really need performance and can afford to loose some accuracy then replace your Math.sqrt() with one from this page. It basically says that you can convert between int and float by making a struct with LayoutKind.Explicit like this:

private struct FloatIntUnion
    public float f;

    public int tmp;

Observer that this will not give you the same value in the int and the float as that requires a calculation for conversion. It will only allow you to use the same storage bits and treat them as either int/float. And then you can save half the time by calculating SQRT like this:

public static float QuickSqrt(float z){
    if (z == 0) return 0;
    FloatIntUnion u;
    u.tmp = 0;
    float xhalf = 0.5f * z;
    u.f = z;
    u.tmp = 0x5f375a86 - (u.tmp >> 1);
    u.f = u.f * (1.5f - xhalf * u.f * u.f);
    return u.f * z;

The article mentions that Quake 3 used this method :)

Upvotes: 0


I'm not sure if this could make sense but what I did is simply created a dictionary for previously calculated value(and some cleanup...), main reason is after doing some profiling, 60 to 70% of the cpu time is with these 2 lines:

    r = t * t.Dot(u) - u * t.z;


so here it is;

    private static unsafe void CombineImage(Bitmap image1, Bitmap image2, int index)
        Dictionary<long, int> testDict = new Dictionary<long, int>(); //the magic is wit this dictionary

        var combinedBitmap = new Bitmap(image1.Width, image1.Height, image1.PixelFormat);

        BitmapData bData1 = image1.LockBits(new Rectangle(0, 0, image1.Width, image1.Height), ImageLockMode.ReadOnly, image1.PixelFormat);
        BitmapData bData2 = image2.LockBits(new Rectangle(0, 0, image2.Width, image2.Height), ImageLockMode.ReadOnly, image2.PixelFormat);
        BitmapData bDataCombined = combinedBitmap.LockBits(new Rectangle(0, 0, combinedBitmap.Width, combinedBitmap.Height), ImageLockMode.WriteOnly, combinedBitmap.PixelFormat);

        byte* dataBase = (byte*)bData1.Scan0.ToPointer();
        byte* dataDetail = (byte*)bData2.Scan0.ToPointer();
        byte* dataCombined = (byte*)bDataCombined.Scan0.ToPointer();

        const int bitsPerPixel = 24;
        const int xIncr = bitsPerPixel / 8;

        var t = new Vec3f(0);
        var u = new Vec3f(0);
        var r = new Vec3f(0);

        int h = bData1.Height, w = bData1.Width;
        long key;
        int value;

        Stopwatch combineStopwatch = Stopwatch.StartNew();
        for (int y = 0; y < h; ++y)
            for (int x = 0; x < w; ++x)
                //real magic!
                key = dataBase[0] | (dataBase[1] << 8) | (dataBase[2] << 16) | (dataDetail[0] << 24) | (dataDetail[1] << 32) | (dataDetail[2] << 40);
                if (testDict.ContainsKey(key))
                    value = testDict[key];
                    dataCombined[0] = (byte)(value & 255);
                    dataCombined[1] = (byte)((value >> 8) & 255);
                    dataCombined[2] = (byte)((value >> 16) & 255);
                    t.z = (dataBase[0] / 255.0f) * 2.0f;
                    t.y = (dataBase[1] / 255.0f) * 2.0f - 1.0f;
                    t.x = (dataBase[2] / 255.0f) * 2.0f - 1.0f;

                    u.z = (dataDetail[0] / 255.0f) * 2.0f - 1.0f;
                    u.y = (dataDetail[1] / 255.0f) * -2.0f + 1.0f;
                    u.x = (dataDetail[2] / 255.0f) * -2.0f + 1.0f;

                    r = t * t.Dot(u) - u * t.z;


                    //Write data to our new bitmap
                    dataCombined[0] = (byte)Math.Round((r.z * 0.5f + 0.5f) * 255.0f);
                    dataCombined[1] = (byte)Math.Round((r.y * 0.5f + 0.5f) * 255.0f);
                    dataCombined[2] = (byte)Math.Round((r.x * 0.5f + 0.5f) * 255.0f);

                    value = dataCombined[0] | (dataCombined[1] << 8) | (dataCombined[2] << 16);
                    testDict.Add(key, value);

                dataBase += xIncr;
                dataDetail += xIncr;
                dataCombined += xIncr;

        //combinedBitmap.Save("helloyou.png", ImageFormat.Png);

        Console.Write(combineStopwatch.ElapsedMilliseconds + "\n");

Upvotes: 1


I would suggest removing the many divide by 255 statements and scaling the math so you also remove the multiplies by 255 as well. You could probably convert the whole thing to integer math as well.

The other thing to look at is your memory access pattern or method calls for m_VectorImageArray -- are they slowing this down? Comment that out to find out. Where is the declaration of that object?

Upvotes: 1


I reduced the number of multiplications and divisions performed in every iteration, so I guess it should go a little faster. Not tested.

var t = new Vec3f(0);
var u = new Vec3f(0);
var r = new Vec3f(0);

int xIncr = m_BitsPerPixel / 8;
byte* dataBase = bData1Scan0Ptr;
byte* dataDetail = bData2Scan0Ptr;
byte* nextBase = dataBase + bData1.Stride;
byte* nextDetail = dataDetail + bData2.Stride;

byte* dataCombined = bDataCombinedScan0Ptr;
byte* nextCombined = dataCombined + bDataCombined.Stride;

for (int y = 0; y < bData1.Height; ++y)
    for (int x = 0; x < bData1.Width; ++x)
        t.x = (dataBase[2] / 255.0f) * 2.0f - 1.0f;
        t.y = (dataBase[1] / 255.0f) * 2.0f - 1.0f;
        t.z = (dataBase[0] / 255.0f) * 2.0f;

        u.x = (dataDetail[2] / 255.0f) * -2.0f + 1.0f;
        u.y = (dataDetail[1] / 255.0f) * -2.0f + 1.0f;
        u.z = (dataDetail[0] / 255.0f) * 2.0f - 1.0f;

        r = t * t.Dot(u) - u * t.z;


        //Write data to our new bitmap
        dataCombined[2] = (byte)Math.Round((r.x * 0.5f + 0.5f) * 255.0f);
        dataCombined[1] = (byte)Math.Round((r.y * 0.5f + 0.5f) * 255.0f);
        dataCombined[0] = (byte)Math.Round((r.z * 0.5f + 0.5f) * 255.0f);

        m_VectorImageArray[index, y, x] = t;    //base
        m_VectorImageArray[index + 1, y, x] = u;  //detail
        m_VectorImageArray[index + 2, y, x] = r;  //Combined

        dataBase += xIncr;
        dataDetail += xIncr;
        dataCombined += xIncr;
    dataBase = nextBase;
    nextBase += bData1.Stride;
    dataDetail = nextDetail;
    nextDetail += bData2.Stride;
    dataCombined = nextCombined;
    nextCombined += bDataCombined.Stride;


Upvotes: 3

