Use of shared memory to reduce computational time of calculations inside CUDA kernel

Question

I have an image of size 1920 x 1080. I am transferring from H2D, processing and transferring back from D2H using three CUDA streams where each stream is responsible to take care of 1/3rd of total data. I am able to optimize the dimensions of block and number of threads per block by understanding the concept of SM, SP, warps. The code run satisfactorily (takes 2 ms) if it has to do simple calculations inside kernel. The simple calculation code below find the R, G and B value from source image and then place those values in the same source image.

ptr_source[numChannels*  (iw*y + x) + 0] = ptr_source[numChannels*  (iw*y + x) + 0];
ptr_source[numChannels*  (iw*y + x) + 1] = ptr_source[numChannels*  (iw*y + x) + 1];
ptr_source[numChannels*  (iw*y + x) + 2] = ptr_source[numChannels*  (iw*y + x) + 2];

But I have to perform few more calculations which are independent of all other threads then, the computational time gets increased by 6 ms which is too much for my application. I have already tried to declare the mostly used constant values inside the constant memory. The code for these calculation is shown below. In that code, I am again finding the R, G and B values. Then, I am calculating new values of R, G and B by multiplying the old values with some constants and finally I am putting these new R, G and B values again in the same source image at their corresponding positions.

__constant__ int iw = 1080;
__constant__ int ih = 1920;
__constant__ int numChannels = 3;


__global__ void cudaKernel(unsigned char *ptr_source, int numCudaStreams)
{

    // Calculate our pixel's location
    int x = (blockIdx.x * blockDim.x) + threadIdx.x;
    int y = (blockIdx.y * blockDim.y) + threadIdx.y;

    // Operate only if we are in the correct boundaries
    if (x >= 0 && x < iw && y >= 0 && y < ih / numCudaStreams)
    {

        const int index_b = numChannels*  (iw*y + x) + 0;
        const int index_g = numChannels*  (iw*y + x) + 1;
        const int index_r = numChannels*  (iw*y + x) + 2;

        //GET VALUES: get the R,G and B values from Source image
        unsigned char b_val = ptr_source[index_b];
        unsigned char g_val = ptr_source[index_g];
        unsigned char r_val = ptr_source[index_r];

        float float_r_val = ((1.574090) * (float)r_val + (0.088825) * (float)g_val + (-0.1909)  * (float)b_val);
        float float_g_val = ((-0.344198) * (float)r_val + (1.579802) * (float)g_val + (-1.677604)  * (float)b_val);
        float float_b_val = ((-1.012951) * (float)r_val + (-1.781485) * (float)g_val + (2.404436)  * (float)b_val);


        unsigned char dst_r_val = (float_r_val > 255.0f) ? 255 : static_cast(float_r_val);
        unsigned char dst_g_val = (float_g_val > 255.0f) ? 255 : static_cast(float_g_val);
        unsigned char dst_b_val = (float_b_val > 255.0f) ? 255 : static_cast(float_b_val);

        //PUT VALUES---put the new calculated values of R,G and B
        ptr_source[index_b] = dst_b_val;
        ptr_source[index_g] = dst_g_val;
        ptr_source[index_r] = dst_r_val;

    }
}

Problem: I think that transferring the image segment (i.e. ptr_src) to the shared memory will help but I am quite confused about how to do it. I mean, the scope of shared memory is for one block only so, how do I manage the transfer of image segment to the shared memory.

PS: My GPU is Quadro K2000, compute 3.0, 2 SM, 192 SP per SM.

X3liF · Accepted Answer

Shared memory won't help for your case, your memory accesses are not coaslescent.

You can try the following : replace your char* ptr_source into a uchar3* should probably helps your threads accessing contiguous datas in your array. uchar3 just means : 3 contiguous unsigned char.

since threads within a same warp execute same instruction at the same time you'll have this kind of access pattern :

Supposing you try to access memory at adress : 0x3F0000.

thread 1 copies data at : 0x3F0000 then 0x3F0001 then 0x3F0002

thread 2 copies data at : 0x3F0003 then 0x3F0004 then 0x3F0005

0x3F0000 and 0x3F0003 are not contiguous, so you'll have bad performance accessing to you datas.

with uchar3 uses :

thread 1 : 0x3F0000 to 0x3F0002

thread 2 : 0x3F0003 to 0x3F0005

like each thread copies continous datas your memory controller can copy it quickly.

You can too replace :

(float_r_val > 255.0f) ? 255 : static_cast(float_r_val);

with

float_r_val = fmin(255.0f, float_r_val);

this should give you a kernel like this :

__global__ void cudaKernel(uchar3 *ptr_source, int numCudaStreams)
{

    // Calculate our pixel's location
    int x = (blockIdx.x * blockDim.x) + threadIdx.x;
    int y = (blockIdx.y * blockDim.y) + threadIdx.y;

    // Operate only if we are in the correct boundaries
    if (x >= 0 && x < iw && y >= 0 && y < ih / numCudaStreams)
    {
        const int index =   (iw*y + x);

        uchar3 val = ptr_source)[index];

        float float_r_val = ((1.574090f) * (float)val.x + (0.088825f) * (float)val.y + (-0.1909f)  * (float)b_val.z);
        float float_g_val = ((-0.344198f) * (float)val.x + (1.579802f) * (float)val.y + (-1.677604f)  * (float)b_val.z);
        float float_b_val = ((-1.012951f) * (float)val.x + (-1.781485f) * (float)val.y + (2.404436f)  * (float)b_val.z);

        ptr_source[index] = make_uchar3( fmin(255.0f, float_r_val), fmin(255.0f, float_g_val), fmin(255.0f, float_b_val) );
    }
}

i hope these update will improve performance.

Use of shared memory to reduce computational time of calculations inside CUDA kernel

Answers (2)

Related Questions