Why does changing the block and grid sizes have such great impact on runtime?

Question

I am working on some cuda tutorial converting a RGBA picture to greyscale. But I couldn't figure out why changing the blockSize and gridSize makes a X33 time improvment.

__global__
void rgba_to_greyscale(const uchar4* const rgbaImage,
                       unsigned char* const greyImage,
                       int numRows, int numCols)
{
    int i = blockIdx.x*numCols + threadIdx.x;
    float channelSum = .299f * rgbaImage[i].x + .587f * rgbaImage[i].y + .114f * rgbaImage[i].z;
    greyImage[i]= channelSum;
}

void your_rgba_to_greyscale(const uchar4 * const h_rgbaImage, uchar4 * const d_rgbaImage,
                            unsigned char* const d_greyImage, size_t numRows, size_t numCols)
{
  const dim3 blockSize(numCols, 1, 1);
  const dim3 gridSize(numRows, 1 , 1);
  rgba_to_greyscale<<>>(d_rgbaImage, d_greyImage, numRows, numCols);

  cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
}

When I set as above:

const dim3 blockSize(numCols, 1, 1);
const dim3 gridSize(numRows, 1 , 1);

I get Your code executed in 0.030304 ms

When I set:

 const dim3 blockSize(1, 1, 1);
 const dim3 gridSize(numRows, numCols , 1);

and updating the thread function to work with the new index:

int i = blockIdx.x*numCols + blockIdx.y;

I get Your code executed in 0.995456 ms.

I would expect it to be the opposite since the gpu can calculate all the pixel separately on the second grid splitting Is it related to cache coherency issues? why do I get these results?
What is the best by theory the grid and block size for this problem? is it possible to calculate it on runtime?

FYI:

numRows = 313 numCols =557

Technical properties:

#uname -a && /usr/bin/nvidia-settings -v
    Linux ip-10-16-23-92 3.2.0-39-virtual #62-Ubuntu SMP Thu Feb 28 00:48:27 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

    nvidia-settings:  version 304.54  (buildmeister@swio-display-x86-rhel47-11)

sgarizvi · Accepted Answer

Neither of the grid/block configuration is recommended. The first one is not scale-able because the number of threads per block is limited for the GPU therefore it will eventually fail for larger image size. The second one is a poor choice because there is only 1 thread per block which is not recommended as the GPU occupancy would be very low. You can verify it through the GPU Occupancy Calculator included with the CUDA Toolkit. The recommended block size should be a multiple of GPU warp size (16 or 32) depending on the GPU.

A general and scale-able approach for 2D grid and block size in your case would be something like this:

const dim3 blockSize(16, 16, 1);
const dim3 gridSize((numCols + blockSize.x - 1)/blockSize.x, (numRows + blockSize.y - 1)/blockSize.y , 1);

You can change the block size from 16 x 16 to any size you like, provided you keep in the limits of the device. Maximum 512 threads per block is allowed for devices of compute capability 1.0 to 1.3. For device of compute capability 2.0 onward, this limit is 1024 threads per block.

As now, the grid and block are 2 dimensional, the indexing inside the kernel would be modified as follows:

int i = blockIdx.x * blockDim.x + threadIdx.x; //Column
int j = blockIdx.y * blockDim.y + threadIdx.y; //Row

int idx = j * numCols + i;

//Don't forget to perform bound checks
if(i>=numCols || j>=numRows) return;

float channelSum = .299f * rgbaImage[idx].x + .587f * rgbaImage[idx].y + .114f *     rgbaImage[idx].z;
greyImage[idx]= channelSum;

Why does changing the block and grid sizes have such great impact on runtime?

Answers (1)

Related Questions