CUDA: Why does kernel's execution time decreases if I allocate more threads in a block than the maximum number?

Question

I'm a CUDA learning student and I'm trying to write a CUDA algorithm for counting sort. I tried to execute my kernel :

__global__ void kernelCountingSort(int *array, int dim_array, int *counts) {
    // define index
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int count = 0;
    // check that the thread is not out of the vector boundary
    if (i >= dim_array) return;

    for (int j = 0; j < dim_array; ++j) {
        if (array[j] < array[i])
            count++;
        else if (array[j] == array[i] && j < i)
            count++;
    }
    counts[count] = array[i];

and this is how I compute my block dimension and grid dimension:

dim_array = atoi(argv[1]);
block_size = atoi(argv[2]);
    
dim3 block(block_size);
grid_size = (dim_array - 1) / block.x + 1;
dim3 grid(grid_size);

I noticed that, as I allocate more threads in a block than the max number allowed of threads for a block, execution time decreases.

I'm using colab with a Tesla T4, with the following specs:

CUDA: Why does kernel's execution time decreases if I allocate more threads in a block than the maximum number?

Answers (1)

Related Questions

CUDA: Why does kernel&#39;s execution time decreases if I allocate more threads in a block than the maximum number?

Answers (1)

Related Questions

CUDA: Why does kernel's execution time decreases if I allocate more threads in a block than the maximum number?