Mweidy
Mweidy

Reputation: 31

Why SIMD only improves performance by only a little bit for RGB to Grayscale, with SIMD multiply but scalar add of vector elements?

I am learning how to use SIMD for image processing. However, I wonder why I have not seen much improvement in the performance after using SIMD.

  1. Normal way with unsafe & Parallel.For: 48ms in average
static void GrayViaParallel(BitmapData org, BitmapData des)
{
    int width = org.Width;
    int height = org.Height;

    var orgp = (byte*)org.Scan0.ToPointer();
    var desp = (byte*)des.Scan0.ToPointer();

    Parallel.For(0, height, i =>
    {
        int orgSd = i * org.Stride;
        int desSd = i * des.Stride;
        for (int j = 0; j < width; j++)
        {
            //                              Red                     Green                  Blue
            desp[desSd] = (byte)((orgp[orgSd + 2] * 19595 + orgp[orgSd + 1] * 38469 + orgp[orgSd] * 7472) >> 16);
            desSd++;
            orgSd += 3;
        }
    });
}
  1. Implemented SIMD: 32ms in average
static void GrayViaParallelAndSIMD(byte* src, byte* dst, int count)
{
    var Coeleft = Vector128.Create(mulBlue, mulGreen, mulRed, mulBlue, mulGreen, mulRed, mulBlue, mulGreen);
    var CoeRight = Vector128.Create(mulRed, mulBlue, mulGreen, mulRed, mulBlue, mulGreen, mulRed, 0);

    int allPixels = count * 3;
    byte* srcEnd = src + allPixels; //Is it wrong?
    int stride = 15; //Proceed 15 bytes per step
    int loopCount = (int)((srcEnd - src) / stride);

    Parallel.For(0, loopCount, i =>
    {
        int curPos = (i + 1) * stride;
        if (curPos < allPixels) //If not added,  it will exceed the image data
        {
            // Load the first 16 bytes of the pixels
            var _1st16bytes = Sse2.LoadVector128(src + i * stride);

            // Get the first 8 bytes
            var low = Sse2.UnpackLow(_1st16bytes, Vector128<byte>.Zero).AsUInt16();
            //Get the next 8 bytes
            var high = Sse2.UnpackHigh(_1st16bytes, Vector128<byte>.Zero).AsUInt16();

            // Calculate the first 8 bytes
            var lowMul = Sse2.MultiplyHigh(Coeleft, low);
            // Calculate the next 8 bytes
            var highMul = Sse2.MultiplyHigh(CoeRight, high);

            //               Blue                     Green                   Red
            var px1 = lowMul.GetElement(0)  + lowMul.GetElement(1)  + lowMul.GetElement(2);
            var px2 = lowMul.GetElement(3)  + lowMul.GetElement(4)  + lowMul.GetElement(5);
            var px3 = lowMul.GetElement(6)  + lowMul.GetElement(7)  + highMul.GetElement(0);
            var px4 = highMul.GetElement(1) + highMul.GetElement(2) + highMul.GetElement(3);
            var px5 = highMul.GetElement(4) + highMul.GetElement(5) + highMul.GetElement(6);

            //15 bytes for 5 pixels 
            var i5 = i * 5;

            dst[i5    ] = (byte)px1;
            dst[i5 + 1] = (byte)px2;
            dst[i5 + 2] = (byte)px3;
            dst[i5 + 3] = (byte)px4;
            dst[i5 + 4] = (byte)px5;
        }
    });
}

Is there a better way to do this, or how can I enhance it, please? I would be very grateful for any advice.

I have experimented with different SIMD approaches, but none of them seem to work well. I am looking for a substantial boost in efficiency.

Upvotes: 2

Views: 231

Answers (1)

zyl910
zyl910

Reputation: 321

The GetElement method is a scalar operation and does not have SIMD hardware acceleration. To get better SIMD hardware acceleration, you need to avoid scalar operations as much as possible.

For 24-bit to 8-bit grayscale conversion, this approach can be used: read 3 vectors at a time from the source bitmap, perform a 3-element group deinterleave operation, and obtain the R,G,B plane data. Subsequently, vectorized multiplication and addition are used to calculate the grayscale values. The result is a 1-vector that stores the grayscale values, which can be stored to the destination bitmap.

For example, the SSE instruction set uses 128-bit vector, where 1 vector is 16 bytes. Reading 3 vectors at a time from the source bitmap is reading 48 bytes, or 16 RGB pixels. Finally, store one vector into the destination bitmap means writing 16 bytes, which is 16 grayscale pixels.

For deinterleaving 3-element groups, this can be done using instructions of the shuffle category. For example, for a 128 bit vector in the X86 architecture, you can use the _mm_shuffle_epi8 instruction in SSSE3, which corresponds to the Ssse3.Shuffle method in .NET. The source code is as follows.

static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_X_Part0 = Vector128.Create((sbyte)0, 3, 6, 9, 12, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1).AsByte();
static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_X_Part1 = Vector128.Create((sbyte)-1, -1, -1, -1, -1, -1, 2, 5, 8, 11, 14, -1, -1, -1, -1, -1).AsByte();
static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_X_Part2 = Vector128.Create((sbyte)-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 4, 7, 10, 13).AsByte();
static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_Y_Part0 = Vector128.Create((sbyte)1, 4, 7, 10, 13, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1).AsByte();
static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_Y_Part1 = Vector128.Create((sbyte)-1, -1, -1, -1, -1, 0, 3, 6, 9, 12, 15, -1, -1, -1, -1, -1).AsByte();
static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_Y_Part2 = Vector128.Create((sbyte)-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 2, 5, 8, 11, 14).AsByte();
static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_Z_Part0 = Vector128.Create((sbyte)2, 5, 8, 11, 14, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1).AsByte();
static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_Z_Part1 = Vector128.Create((sbyte)-1, -1, -1, -1, -1, 1, 4, 7, 10, 13, -1, -1, -1, -1, -1, -1).AsByte();
static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_Z_Part2 = Vector128.Create((sbyte)-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 3, 6, 9, 12, 15).AsByte();

public static Vector128<byte> YGroup3Unzip(Vector128<byte> data0, Vector128<byte> data1, Vector128<byte> data2, out Vector128<byte> y, out Vector128<byte> z) {
    var f0A = YGroup3Unzip_Shuffle_Byte_X_Part0;
    var f0B = YGroup3Unzip_Shuffle_Byte_X_Part1;
    var f0C = YGroup3Unzip_Shuffle_Byte_X_Part2;
    var f1A = YGroup3Unzip_Shuffle_Byte_Y_Part0;
    var f1B = YGroup3Unzip_Shuffle_Byte_Y_Part1;
    var f1C = YGroup3Unzip_Shuffle_Byte_Y_Part2;
    var f2A = YGroup3Unzip_Shuffle_Byte_Z_Part0;
    var f2B = YGroup3Unzip_Shuffle_Byte_Z_Part1;
    var f2C = YGroup3Unzip_Shuffle_Byte_Z_Part2;
    var rt0 = Sse2.Or(Sse2.Or(Ssse3.Shuffle(data0, f0A), Ssse3.Shuffle(data1, f0B)), Ssse3.Shuffle(data2, f0C));
    var rt1 = Sse2.Or(Sse2.Or(Ssse3.Shuffle(data0, f1A), Ssse3.Shuffle(data1, f1B)), Ssse3.Shuffle(data2, f1C));
    var rt2 = Sse2.Or(Sse2.Or(Ssse3.Shuffle(data0, f2A), Ssse3.Shuffle(data1, f2B)), Ssse3.Shuffle(data2, f2C));
    y = rt1;
    z = rt2;
    return rt0;
}

To make it easier to write vector algorithms across platforms, I have developed the VectorTraits library, which already integrates the above algorithms. The library provides the method Vectors.YGroup3Unzip method. The method is cross-platform and uses the shuffle instructions of each platform.

  • X86 256-bit: Use _mm256_shuffle_epi8 and other instructions.
  • Arm: Use vqvtbl1q_u8 instructions.
  • Wasm: Use i8x16.swizzle instructions.

With the Vectors.YGroup3Unzip method, it is easy to write a 24-bit to 8-bit grayscale algorithm. The grayscale coefficients have 8-bit precision, so you need to widen the 8-bit data to 16-bit and then compute the multiplication and addition. Finally, the 16-bit data is narrowed to 8 bits. The source code is as follows.

public static unsafe void UseVectorsDoBatch(byte* pSrc, int strideSrc, int width, int height, byte* pDst, int strideDst) {
    const int cbPixel = 3; // Bgr24
    const int shiftPoint = 8;
    const int mulPoint = 1 << shiftPoint; // 0x100
    const ushort mulRed = (ushort)(0.299 * mulPoint + 0.5); // 77
    const ushort mulGreen = (ushort)(0.587 * mulPoint + 0.5); // 150
    const ushort mulBlue = mulPoint - mulRed - mulGreen; // 29
    Vector<ushort> vmulRed = new Vector<ushort>(mulRed);
    Vector<ushort> vmulGreen = new Vector<ushort>(mulGreen);
    Vector<ushort> vmulBlue = new Vector<ushort>(mulBlue);
    int vectorWidth = Vector<byte>.Count;
    int maxX = width - vectorWidth;
    byte* pRow = pSrc;
    byte* qRow = pDst;
    for (int i = 0; i < height; i++) {
        Vector<byte>* pLast = (Vector<byte>*)(pRow + maxX * cbPixel);
        Vector<byte>* qLast = (Vector<byte>*)(qRow + maxX * 1);
        Vector<byte>* p = (Vector<byte>*)pRow;
        Vector<byte>* q = (Vector<byte>*)qRow;
        for (; ; ) {
            Vector<byte> r, g, b, gray;
            Vector<ushort> wr0, wr1, wg0, wg1, wb0, wb1;
            // Load.
            b = Vectors.YGroup3Unzip(p[0], p[1], p[2], out g, out r);
            // widen(r) * mulRed + widen(g) * mulGreen + widen(b) * mulBlue
            Vector.Widen(r, out wr0, out wr1);
            Vector.Widen(g, out wg0, out wg1);
            Vector.Widen(b, out wb0, out wb1);
            wr0 = Vectors.Multiply(wr0, vmulRed);
            wr1 = Vectors.Multiply(wr1, vmulRed);
            wg0 = Vectors.Multiply(wg0, vmulGreen);
            wg1 = Vectors.Multiply(wg1, vmulGreen);
            wb0 = Vectors.Multiply(wb0, vmulBlue);
            wb1 = Vectors.Multiply(wb1, vmulBlue);
            wr0 = Vector.Add(wr0, wg0);
            wr1 = Vector.Add(wr1, wg1);
            wr0 = Vector.Add(wr0, wb0);
            wr1 = Vector.Add(wr1, wb1);
            // Shift right and narrow.
            wr0 = Vectors.ShiftRightLogical_Const(wr0, shiftPoint);
            wr1 = Vectors.ShiftRightLogical_Const(wr1, shiftPoint);
            gray = Vector.Narrow(wr0, wr1);
            // Store.
            *q = gray;
            // Next.
            if (p >= pLast) break;
            p += cbPixel;
            ++q;
            if (p > pLast) p = pLast; // The last block is also use vector.
            if (q > qLast) q = qLast;
        }
        pRow += strideSrc;
        qRow += strideDst;
    }
}

The Vectors.ShiftRightLogical_Const in the source code above is a method provided by the VectorTraits library. It replaces the Vector.ShiftRightLogical method that was new in .NET 7.0, and allows earlier versions of .NET to use logical right shift.

The Vectors.Multiply is also a method provided by the VectorTraits library. It avoids the problem that unsigned types are sometimes no hardware accelerated.

Then write benchmark code for the algorithm.

[Benchmark]
public void UseVectors() {
    UseVectorsDo(_sourceBitmapData, _destinationBitmapData, false);
}

[Benchmark]
public void UseVectorsParallel() {
    UseVectorsDo(_sourceBitmapData, _destinationBitmapData, true);
}

public static unsafe void UseVectorsDo(BitmapData src, BitmapData dst, bool useParallel = false) {
    int vectorWidth = Vector<byte>.Count;
    int width = src.Width;
    int height = src.Height;
    if (width <= vectorWidth) {
        ScalarDo(src, dst);
        return;
    }
    int strideSrc = src.Stride;
    int strideDst = dst.Stride;
    byte* pSrc = (byte*)src.Scan0.ToPointer();
    byte* pDst = (byte*)dst.Scan0.ToPointer();
    int processorCount = Environment.ProcessorCount;
    int batchSize = height / (processorCount * 2);
    bool allowParallel = useParallel && (batchSize > 0) && (processorCount > 1);
    if (allowParallel) {
        int batchCount = (height + batchSize - 1) / batchSize; // ceil((double)length / batchSize)
        Parallel.For(0, batchCount, i => {
            int start = batchSize * i;
            int len = batchSize;
            if (start + len > height) len = height - start;
            byte* pSrc2 = pSrc + start * strideSrc;
            byte* pDst2 = pDst + start * strideDst;
            UseVectorsDoBatch(pSrc2, strideSrc, width, len, pDst2, strideDst);
        });
    } else {
        UseVectorsDoBatch(pSrc, strideSrc, width, height, pDst, strideDst);
    }
}

The full source code at Bgr24ToGray8Benchmark.cs

The results of the benchmark on the X86 architecture are as follows.

BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4460/23H2/2023Update/SunValley3)
AMD Ryzen 7 7840H w/ Radeon 780M Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.403
  [Host]     : .NET 8.0.10 (8.0.1024.46610), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  DefaultJob : .NET 8.0.10 (8.0.1024.46610), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI


| Method               | Width | Mean         | Error      | StdDev     | Ratio | RatioSD | Code Size |
|--------------------- |------ |-------------:|-----------:|-----------:|------:|--------:|----------:|
| Scalar               | 1024  |  1,028.55 us |  12.545 us |  11.735 us |  1.00 |    0.02 |     152 B |
| UseVectors           | 1024  |     94.06 us |   0.606 us |   0.537 us |  0.09 |    0.00 |        NA |
| UseVectorsParallel   | 1024  |     24.98 us |   0.390 us |   0.365 us |  0.02 |    0.00 |        NA |
| PeterParallelScalar  | 1024  |    216.47 us |   1.719 us |   1.524 us |  0.21 |    0.00 |        NA |
|                      |       |              |            |            |       |         |           |
| Scalar               | 2048  |  4,092.26 us |  21.098 us |  18.703 us |  1.00 |    0.01 |     152 B |
| UseVectors           | 2048  |    507.70 us |   9.626 us |  11.459 us |  0.12 |    0.00 |        NA |
| UseVectorsParallel   | 2048  |    118.98 us |   1.025 us |   0.959 us |  0.03 |    0.00 |        NA |
| PeterParallelScalar  | 2048  |    803.30 us |   9.226 us |   8.630 us |  0.20 |    0.00 |        NA |
|                      |       |              |            |            |       |         |           |
| Scalar               | 4096  | 16,391.12 us | 121.643 us | 113.785 us |  1.00 |    0.01 |     152 B |
| UseVectors           | 4096  |  2,472.16 us |  32.452 us |  30.356 us |  0.15 |    0.00 |        NA |
| UseVectorsParallel   | 4096  |  2,034.85 us |  33.074 us |  30.937 us |  0.12 |    0.00 |        NA |
| PeterParallelScalar  | 4096  |  3,139.85 us |  32.657 us |  27.270 us |  0.19 |    0.00 |        NA |
  • Scalar: Scalar algorithm.
  • UseVectors: Vector algorithm.
  • UseVectorsParallel: Vector algorithm with Parallel.
  • PeterParallelScalar: Scalar algorithm with Parallel written by Peter (GrayViaParallel).

The same source code can be run on the Arm architecture. The benchmark results are as follows.

BenchmarkDotNet v0.14.0, macOS Sequoia 15.0.1 (24A348) [Darwin 24.0.0]
Apple M2, 1 CPU, 8 logical and 8 physical cores
.NET SDK 8.0.204
  [Host]     : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD [AttachedDebugger]
  DefaultJob : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD


| Method               | Width | Mean         | Error     | StdDev    | Ratio | RatioSD |
|--------------------- |------ |-------------:|----------:|----------:|------:|--------:|
| Scalar               | 1024  |    635.31 us |  0.537 us |  0.448 us |  1.00 |    0.00 |
| UseVectors           | 1024  |    127.04 us |  0.567 us |  0.474 us |  0.20 |    0.00 |
| UseVectorsParallel   | 1024  |     46.37 us |  0.336 us |  0.314 us |  0.07 |    0.00 |
| PeterParallelScalar  | 1024  |    202.19 us |  1.025 us |  0.959 us |  0.32 |    0.00 |
|                      |       |              |           |           |       |         |
| Scalar               | 2048  |  2,625.64 us |  1.795 us |  1.402 us |  1.00 |    0.00 |
| UseVectors           | 2048  |    521.40 us |  0.301 us |  0.282 us |  0.20 |    0.00 |
| UseVectorsParallel   | 2048  |    152.11 us |  3.548 us | 10.064 us |  0.06 |    0.00 |
| PeterParallelScalar  | 2048  |    711.00 us |  1.806 us |  1.601 us |  0.27 |    0.00 |
|                      |       |              |           |           |       |         |
| Scalar               | 4096  | 10,457.09 us |  5.697 us |  5.051 us |  1.00 |    0.00 |
| UseVectors           | 4096  |  2,058.16 us |  4.110 us |  3.643 us |  0.20 |    0.00 |
| UseVectorsParallel   | 4096  |  1,152.15 us | 21.134 us | 21.703 us |  0.11 |    0.00 |
| PeterParallelScalar  | 4096  |  2,897.94 us | 56.893 us | 91.871 us |  0.28 |    0.01 |

Upvotes: 1

Related Questions