Marco A.
Marco A.

Reputation: 43662

CUDA - no blocks, just threads for undefined dimensions

I have some matrices with unknown sizes varying from 10-20.000 in both directions.

I designed a CUDA kernel with (x;y) blocks and (x;y) threads.

Since matrices width/height aren't multiple of my dimensions, it was a terrible pain to get things work and the code is becoming more and more complicated to get coalescence memory reads.

Besides all of that, the kernel is growing in size using more and more registers to check for correctness... so I think this is not the way I should adopt.

My question is: what if I totally eliminate blocks and just create a grid of x;y threads? Will a SM unit have problems without many blocks?

Can I eliminate blocks and use a large amount of threads or is the block subdivision necessary?

Upvotes: 0

Views: 1027

Answers (2)

Jonas Bötel
Jonas Bötel

Reputation: 4482

Here's the code i use to divide a given task requiring num_threads into block and grid. Yes, you might end up launching to many blocks (but only very few) and you will probably end up having more actual threads than required, but it's easy and efficient this way. See the second code example below for my simple in-kernel boundary check.

PS: I always have block_size == 128 because it has been a good tradeoff between multicore occupancy, register usage, shared memory requirements and coalescent access for all of my kernels.

Code to calculate a good grid size (host):

#define GRID_SIZE 65535

//calculate grid size (store result in grid/block)
void kernelUtilCalcGridSize(unsigned int num_threads, unsigned int block_size, dim3* grid, dim3* block) {


    //block
    block->x = block_size;
    block->y = 1;
    block->z = 1;


    //number of blocks
    unsigned int num_blocks = kernelUtilCeilDiv(num_threads, block_size);
    unsigned int total_threads = num_blocks * block_size;
    assert(total_threads >= num_threads);

    //calculate grid size
    unsigned int gy = kernelUtilCeilDiv(num_blocks, GRID_SIZE);
    unsigned int gx = kernelUtilCeilDiv(num_blocks, gy);
    unsigned int total_blocks = gx * gy;
    assert(total_blocks >= num_blocks);

    //grid
    grid->x = gx;
    grid->y = gy;
    grid->z = 1;
}

//ceil division (rounding up)
unsigned int kernelUtilCeilDiv(unsigned int numerator, unsigned int denominator) {
    return (numerator + denominator - 1) / denominator;
}

Code to calculate the unique thread id and check boundaries (device):

//some kernel
__global__ void kernelFoo(unsigned int num_threads, ...) {


    //calculate unique id
    const unsigned int thread_id = threadIdx.x;
    const unsigned int block_id = blockIdx.x + blockIdx.y * gridDim.x;
    const unsigned int unique_id = thread_id + block_id * blockDim.x;


    //check range
    if (unique_id >= num_threads) return;

    //do the actual work
    ...
}

I don't think that's a lot of effort/registers/lines-of-code to check for correctness.

Upvotes: 2

Scott
Scott

Reputation: 386

You can't really just make a "grid of threads", since you have to organize threads into blocks and you can have a maximum of 512 threads per block. However, you could effectively do this by using 1 thread per block, which will result in a X by Y grid of 1x1 blocks. However, this will result in pretty terrible performance due to several factors:

  1. According to the CUDA Programming Guide, a SM can handle a maximum of 8 blocks at any time. This will limit you to 8 threads per SM, which isn't enough to fill even a single warp. If you have, say, 48 CUDA cores, you will only be able to handle 384 threads at any given time.

  2. With only 8 threads available on a SM, there will be too few warps to hide memory latencies. The GPU will spend most of its time waiting for memory accesses to complete, rather than doing any computations.

  3. You will be unable to coalesce memory reads and writes, resulting in poor memory bandwidth usage.

  4. You will be effectively unable to leverage shared memory, as this is a shared resource between threads in a block.

While having to ensure correctness for threads in a block is annoying, your performance will be vastly better than your "grid of threads" idea.

Upvotes: 4

Related Questions