James Smith
James Smith

Reputation: 11

Calculating Grid and Block dimensions of a Kernel

Suppose you want to write a kernel that operates on an image of size 400x900 pixels. You also want to assign one GPU thread to each pixel. Your thread blocks are square and you want to use the maximum number of threads per block possible on the device. The maximum number of threads per block is 1024. How would you select the grid dimensions and block dimensions of your kernel?

My understanding of how this works is that attributing one thread to each pixel, I'd need 360,000 (400x900) threads. The data hierarchy goes grid -> block -> threads. I think the formula would end up being 360,000 = (# of blocks)*(# of threads per block), with # of blocks having to be a perfect square number and multiple of 32.

I've tried the numbers from 2 to 4096 and none of them give me an even quotient when dividing from 360,000. Does that mean threads can be an decimal number?

Upvotes: 1

Views: 4952

Answers (2)

sgarizvi
sgarizvi

Reputation: 16796

When processing 2D images with CUDA, a natural intuition is to use 2D block and grid shape. If we want to set maximum possible block size, we have to make sure that the product of its dimensions does not exceed the block size limit. Keeping in mind the limit of block size (1024), following are a few examples of valid block sizes.

dim3 block(32,32); //32 x 32 = 1024
or
dim3 block(64,16); //64 x 16 = 1024
or
dim3 block(16,64); //16 x 64 = 1024 ... Duh

Next comes the calculation of 2D grid size. If we want to map a thread for every pixel, then the grid should be created such that the total number of threads in each dimension is at-least equal to the corresponding image dimension. Remember that grid size means the number of block in each dimension. It means that the total number of threads in a dimension is equal to the product of grid size and block size in that dimension. For a 2D grid, the number of threads in X dimension is equal to block.x * grid.x and in Y dimension equal to block.y * grid.y.

Assuming you have an image of size 400 x 900, then the total number of threads in the corresponding dimension should also be at-least the same.

Let's say you choose a block of size (32,32). Then the number of blocks for the x and y dimensions of the image should be 400/32 and 900/32 . But neither of the image dimensions are an integer multiple of the corresponding block dimensions, so due to integer division we will end up creating grid of size 12 x 28 which will result in total number of threads equal to 384 x 896. (because 32 x 12 = 384 and 32 x 28 = 896).

As we can see that the total number of threads in each dimension are less than the corresponding image dimensions. What we need to do is to round up the number of blocks so that if the image dimension is not a multiple of block dimension, we create an additional block which will cover up the remaining pixels. Following are 2 ways to do that.

Instead of integer division to calculate the number of blocks, we use floating point division and ceil the results.

int image_width = 400;
int image_height = 900;
dim3 block(32,32);
dim3 grid;
grid.x = ceil( float(image_width)/block.x );
grid.y = ceil( float(image_height)/block.y );

Another smart way is to use the following formula

int image_width = 400;
int image_height = 900;
dim3 block(32,32);
dim3 grid;
grid.x = (image_width + block.x - 1 )/block.x;
grid.y = (image_height + block.y - 1 )/block.y;

When the grid is created in the above mentioned ways, you will end up creating a grid of size 13 x 29 which will result in total number of threads equal to 416 x 928.

Now in this case, we have total number of threads in each dimension greater than the corresponding image dimension. This will result in some of the threads accessing memory outside the image bounds causing undefined behavior. The solution for this problem is that we perform bound checks inside the kernel and do processing only with those threads which fall inside the image bounds. Of course to do that, we would need to pass image dimensions as arguments to the kernel. Following sample kernel shows this process.

__global__ void kernel(unsigned char* image, int width, int height)
{
    int xIndex = blockIdx.x * blockDim.x + threadIdx.x; //image x index or column number
    int yIndex = blockIdx.y * blockDim.y + threadIdx.y; //image y index of row number

    if(xIndex < width && yIndex < height)
    {
       //Do processing only here
    }
}

TLDR

Create the grid and block like this:

dim3 block(32,32);
dim3 grid;
grid.x = (image_width + block.x - 1)/block.x;
grid.y = (image_height + block.y - 1)/block.y;

Call the kernel and pass image dimensions as arguments like this:

kernel<<<grid, block>>>(...., image_width, image_height);

Perform bound checks inside the kernel like this:

__global__ void kernel(unsigned char* image, int width, int height)
{
    int xIndex = blockIdx.x * blockDim.x + threadIdx.x; //image x index or column number
    int yIndex = blockIdx.y * blockDim.y + threadIdx.y; //image y index of row number

    if(xIndex < width && yIndex < height)
    {
       //Do processing only here
    }
}

Upvotes: 5

smremde
smremde

Reputation: 1895

Usually, you make the dimensions the next multiple up of the size you need, and then do a bound check in the kernel.

A simple example is here: https://devblogs.nvidia.com/parallelforall/easy-introduction-cuda-c-and-c/

Here the number of blocks is calculated so the total number of threads is equal to or up to +256 above the number of threads needed.

  saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);

And in the kernel, the calculation is only performed if it is required:

__global__
void saxpy(int n, float a, float *x, float *y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}

Upvotes: 2

Related Questions