jdm555
jdm555

Reputation: 107

C# - Fastest way of Interpolating a large byte array (RGB to RGBA)

I am uploading frames from a camera to a texture on the GPU for processing (using SharpDX). My issue is ATM is that I have the frames coming in as 24bit RGB, but DX11 no longer has the 24bit RGB texture format, only 32bit RGBA. After each 3 bytes I need to add another byte with the value of 255 (no transparency). I've tried this method of iterating thru the byte array to add it but it's too expensive. Using GDI bitmaps to convert is also very expensive.

                int count = 0;
                for (int i = 0; i < frameDataBGRA.Length - 3; i+=4)
                {

                    frameDataBGRA[i] = frameData[i - count];
                    frameDataBGRA[i + 1] = frameData[(i + 1) - count];
                    frameDataBGRA[i + 2] = frameData[(i + 2) - count];
                    frameDataBGRA[i + 3] = 255;
                    count++;
    }

Upvotes: 1

Views: 2580

Answers (2)

MaxKlaxx
MaxKlaxx

Reputation: 763

@catflier: good work, but it can go a little faster. ;-)

Reproduced times on my hardware:

  • Base version: 5.48ms
  • Process_Pointer_PerChannel: 2.84ms
  • Process_Pointer_Cast: 2.16ms
  • Process_Pointer_Cast_NoAlpha: 1.60ms

My experiments:

  • FastConvert: 1.45ms
  • FastConvert4: 1.13ms (here: count of pixels must be divisible by 4, but is usually no problem)

Things that have improved speed:

  • your RGB structure must always read 3 single bytes per pixel, but it is faster to read a whole uint (4 bytes) and simply ignore the last byte
  • the alpha value can then be added directly to a uint bit calculation
  • modern processors can often address fixed pointers with offset positions faster than pointers that are incremented themselves.
  • the offset variables in x64 mode should also directly use a 64-bit data value (long instead of int), which reduces the overhead of the accesses
  • the partial rolling out of the inner loop increases some performance again

The Code:

static void FastConvert(int pixelCount, byte[] rgbData, byte[] rgbaData)
{
  fixed (byte* rgbP = &rgbData[0], rgbaP = &rgbaData[0])
  {
    for (long i = 0, offsetRgb = 0; i < pixelCount; i++, offsetRgb += 3)
    {
      ((uint*)rgbaP)[i] = *(uint*)(rgbP + offsetRgb) | 0xff000000;
    }
  }
}

static void FastConvert4Loop(long pixelCount, byte* rgbP, byte* rgbaP)
{
  for (long i = 0, offsetRgb = 0; i < pixelCount; i += 4, offsetRgb += 12)
  {
    uint c1 = *(uint*)(rgbP + offsetRgb);
    uint c2 = *(uint*)(rgbP + offsetRgb + 3);
    uint c3 = *(uint*)(rgbP + offsetRgb + 6);
    uint c4 = *(uint*)(rgbP + offsetRgb + 9);
    ((uint*)rgbaP)[i] = c1 | 0xff000000;
    ((uint*)rgbaP)[i + 1] = c2 | 0xff000000;
    ((uint*)rgbaP)[i + 2] = c3 | 0xff000000;
    ((uint*)rgbaP)[i + 3] = c4 | 0xff000000;
  }
}

static void FastConvert4(int pixelCount, byte[] rgbData, byte[] rgbaData)
{
  if ((pixelCount & 3) != 0) throw new ArgumentException();
  fixed (byte* rgbP = &rgbData[0], rgbaP = &rgbaData[0])
  {
    FastConvert4Loop(pixelCount, rgbP, rgbaP);
  }
}

Upvotes: 1

mrvux
mrvux

Reputation: 8953

Assuming you can compile with unsafe, using pointers in that case will give you significant boost.

First create two structs to hold data in a packed way:

[StructLayout(LayoutKind.Sequential)]
public struct RGBA
{
    public byte r;
    public byte g;
    public byte b;
    public byte a;
}

[StructLayout(LayoutKind.Sequential)]
public struct RGB
{
    public byte r;
    public byte g;
    public byte b;
}

First version :

    static void Process_Pointer_PerChannel(int pixelCount, byte[] rgbData, byte[] rgbaData)
    {
        fixed (byte* rgbPtr = &rgbData[0])
        {
            fixed (byte* rgbaPtr = &rgbaData[0])
            {
                RGB* rgb = (RGB*)rgbPtr;
                RGBA* rgba = (RGBA*)rgbaPtr;
                for (int i = 0; i < pixelCount; i++)
                {
                    rgba->r = rgb->r;
                    rgba->g = rgb->g;
                    rgba->b = rgb->b;
                    rgba->a = 255;
                    rgb++;
                    rgba++;
                }
            }
        }
    }

This avoids a lot of indexing, and passes data directly.

Another version which is slightly faster, to box directly:

    static void Process_Pointer_Cast(int pixelCount, byte[] rgbData, byte[] rgbaData)
    {
        fixed (byte* rgbPtr = &rgbData[0])
        {
            fixed (byte* rgbaPtr = &rgbaData[0])
            {
                RGB* rgb = (RGB*)rgbPtr;
                RGBA* rgba = (RGBA*)rgbaPtr;
                for (int i = 0; i < pixelCount; i++)
                {
                    RGB* cp = (RGB*)rgba;
                    *cp = *rgb;
                    rgba->a = 255;
                    rgb++;
                    rgba++;
                }
            }
        }
    }

One small extra optimization (which is marginal), if you keep the same array all the time and reuse it, you can initialize it once with alpha set to 255 eg :

    static void InitRGBA_Alpha(int pixelCount, byte[] rgbaData)
    {
        for (int i = 0; i < pixelCount; i++)
        {
            rgbaData[i * 4 + 3] = 255;
        }
    }

Then as you will never change this channel, other functions do not need to write into it anymore:

    static void Process_Pointer_Cast_NoAlpha (int pixelCount, byte[] rgbData, byte[] rgbaData)
    {
        fixed (byte* rgbPtr = &rgbData[0])
        {
            fixed (byte* rgbaPtr = &rgbaData[0])
            {
                RGB* rgb = (RGB*)rgbPtr;
                RGBA* rgba = (RGBA*)rgbaPtr;
                for (int i = 0; i < pixelCount; i++)
                {
                    RGB* cp = (RGB*)rgba;
                    *cp = *rgb;
                    rgb++;
                    rgba++;
                }
            }
        }
    }

In my test (running a 1920*1080 image, 100 iterations), I get (i7, x64 release build, average running time)

  • Your version : 6.81ms
  • Process_Pointer_PerChannel : 4.3ms
  • Process_Pointer_Cast : 3.8ms
  • Process_Pointer_Cast_NoAlpha : 3.5ms

Please note that of course all those functions can as well be easily chunked and parts run in multi threaded versions.

If you need higher performance, you have two options ( a bit out of scope from the question)

  • upload your image in a byte address buffer (as rgb), and perform the conversion to texture in a compute shader. That involves some bit shifting and a bit of fiddling with formats, but is reasonably straightforward to achieve.
  • Generally camera images come in Yuv format (with u and v downsampled), so it's mush faster to upload image in that color space and perform conversion to rgba either in pixel shader or compute shader. If your camera sdk allows to get pixel data in that native format, that's the way to go.

Upvotes: 1

Related Questions