CUDA: How should I handle cases where the number of threads cannot be represented as a dimGrid*dimBlock?

Question

Suppose that my input consists of seven data points on which some calculation is performed and the results are written back to an output array of size 7. Declaring the block dimension to be 4 results in a grid size of 2, which leads to an attempt to run a kernel with the invalid thread id (using pt_id=blockIdx.x*blockDim.x+threadID.x) of 7 and failing due to an invalid memory access (since I access some of my arrays based on thread id). I could add code in my kernel that specifically compares the thread id to a max_thread_id parameter and does nothing if thread_id>max_thread_id but am wondering if there's a prettier way to handle ragged input arrays.

stuhlo · Accepted Answer

Having a task of a size which is not multiple of a dimension of block is quite common stuff. The most frequently used solution by me is this. Suppose your input data has size N and you want to configure your launch with the block size equaling to BLOCK_SIZE. In this case, your launch configuration could look like this:

kernel_function<<<(N + BLOCK_SIZE - 1) / BLOCK_SIZE, BLOCK_SIZE>>>(...);

And at the kernel code, each thread determines whether it is supposed to do some work, something like this:

int id = blockIdx.x*blockDim.x + threadIdx.x;
if (id < N) { /* do the stuff */ }
else { return; }

If the size of task (N) depends on the input you have to pass this value into the kernel function as parameter as well. Further, it is quite common to define values of N and BLOCK_SIZE as macros or template parameters.

Finally, if your input array has small size, like in your example, GPU remains underutilized and parallelism brings you nothing or even decreases performance of your algorithm.

CUDA: How should I handle cases where the number of threads cannot be represented as a dimGrid*dimBlock?

Answers (1)

Related Questions