Roberto Falcone
Roberto Falcone

Reputation: 46

CUDA: Why does kernel's execution time decreases if I allocate more threads in a block than the maximum number?

I'm a CUDA learning student and I'm trying to write a CUDA algorithm for counting sort. I tried to execute my kernel :

__global__ void kernelCountingSort(int *array, int dim_array, int *counts) {
    // define index
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int count = 0;
    // check that the thread is not out of the vector boundary
    if (i >= dim_array) return;

    for (int j = 0; j < dim_array; ++j) {
        if (array[j] < array[i])
            count++;
        else if (array[j] == array[i] && j < i)
            count++;
    }
    counts[count] = array[i];

and this is how I compute my block dimension and grid dimension:

dim_array = atoi(argv[1]);
block_size = atoi(argv[2]);
    
dim3 block(block_size);
grid_size = (dim_array - 1) / block.x + 1;
dim3 grid(grid_size);

I noticed that, as I allocate more threads in a block than the max number allowed of threads for a block, execution time decreases.

I'm using colab with a Tesla T4, with the following specs:

enter image description here

Upvotes: 0

Views: 120

Answers (1)

Homer512
Homer512

Reputation: 13463

Beyond what talonmies wrote, in general, using the maximum number of threads per block tends to decrease performance, especially when using synchronization primitives. It's simply a matter of effective utilization shrinking because threads reach barriers or the end of their kernel at different times but then have to wait for all other threads. Multiple blocks per multiprocessor means that threads from a different block can probably still run.

Upvotes: 1

Related Questions