Reputation: 41022
I am working on some cuda tutorial converting a RGBA picture to greyscale.
But I couldn't figure out why changing the blockSize
and gridSize
makes a X33 time improvment.
__global__
void rgba_to_greyscale(const uchar4* const rgbaImage,
unsigned char* const greyImage,
int numRows, int numCols)
{
int i = blockIdx.x*numCols + threadIdx.x;
float channelSum = .299f * rgbaImage[i].x + .587f * rgbaImage[i].y + .114f * rgbaImage[i].z;
greyImage[i]= channelSum;
}
void your_rgba_to_greyscale(const uchar4 * const h_rgbaImage, uchar4 * const d_rgbaImage,
unsigned char* const d_greyImage, size_t numRows, size_t numCols)
{
const dim3 blockSize(numCols, 1, 1);
const dim3 gridSize(numRows, 1 , 1);
rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols);
cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
}
When I set as above:
const dim3 blockSize(numCols, 1, 1);
const dim3 gridSize(numRows, 1 , 1);
I get Your code executed in 0.030304 ms
When I set:
const dim3 blockSize(1, 1, 1);
const dim3 gridSize(numRows, numCols , 1);
and updating the thread function to work with the new index:
int i = blockIdx.x*numCols + blockIdx.y;
I get Your code executed in 0.995456 ms
.
FYI:
numRows = 313 numCols =557
Technical properties:
#uname -a && /usr/bin/nvidia-settings -v
Linux ip-10-16-23-92 3.2.0-39-virtual #62-Ubuntu SMP Thu Feb 28 00:48:27 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
nvidia-settings: version 304.54 (buildmeister@swio-display-x86-rhel47-11)
Upvotes: 1
Views: 687
Reputation: 16816
Neither of the grid/block configuration is recommended. The first one is not scale-able because the number of threads per block is limited for the GPU therefore it will eventually fail for larger image size. The second one is a poor choice because there is only 1 thread per block which is not recommended as the GPU occupancy would be very low. You can verify it through the GPU Occupancy Calculator included with the CUDA Toolkit. The recommended block size should be a multiple of GPU warp size (16 or 32) depending on the GPU.
A general and scale-able approach for 2D grid and block size in your case would be something like this:
const dim3 blockSize(16, 16, 1);
const dim3 gridSize((numCols + blockSize.x - 1)/blockSize.x, (numRows + blockSize.y - 1)/blockSize.y , 1);
You can change the block size from 16 x 16 to any size you like, provided you keep in the limits of the device. Maximum 512 threads per block is allowed for devices of compute capability 1.0 to 1.3. For device of compute capability 2.0 onward, this limit is 1024 threads per block.
As now, the grid and block are 2 dimensional, the indexing inside the kernel would be modified as follows:
int i = blockIdx.x * blockDim.x + threadIdx.x; //Column
int j = blockIdx.y * blockDim.y + threadIdx.y; //Row
int idx = j * numCols + i;
//Don't forget to perform bound checks
if(i>=numCols || j>=numRows) return;
float channelSum = .299f * rgbaImage[idx].x + .587f * rgbaImage[idx].y + .114f * rgbaImage[idx].z;
greyImage[idx]= channelSum;
Upvotes: 6