Reputation: 46
I'm a CUDA learning student and I'm trying to write a CUDA algorithm for counting sort. I tried to execute my kernel :
__global__ void kernelCountingSort(int *array, int dim_array, int *counts) {
// define index
int i = blockIdx.x * blockDim.x + threadIdx.x;
int count = 0;
// check that the thread is not out of the vector boundary
if (i >= dim_array) return;
for (int j = 0; j < dim_array; ++j) {
if (array[j] < array[i])
count++;
else if (array[j] == array[i] && j < i)
count++;
}
counts[count] = array[i];
and this is how I compute my block dimension and grid dimension:
dim_array = atoi(argv[1]);
block_size = atoi(argv[2]);
dim3 block(block_size);
grid_size = (dim_array - 1) / block.x + 1;
dim3 grid(grid_size);
I noticed that, as I allocate more threads in a block than the max number allowed of threads for a block, execution time decreases.
I'm using colab with a Tesla T4, with the following specs:
Upvotes: 0
Views: 120
Reputation: 13463
Beyond what talonmies wrote, in general, using the maximum number of threads per block tends to decrease performance, especially when using synchronization primitives. It's simply a matter of effective utilization shrinking because threads reach barriers or the end of their kernel at different times but then have to wait for all other threads. Multiple blocks per multiprocessor means that threads from a different block can probably still run.
Upvotes: 1