Reputation: 46
I'm a CUDA learning student and I'm trying to write a CUDA algorithm for counting sort:
__global__ void kernelCountingSort(int *array, int dim_array, int *counts) {
// define index
int i = blockIdx.x * blockDim.x + threadIdx.x;
int count = 0;
// check that the thread is not out of the vector boundary
if (i >= dim_array) return;
for (int j = 0; j < dim_array; ++j) {
if (array[j] < array[i])
count++;
else if (array[j] == array[i] && j < i)
count++;
}
counts[count] = array[i];
I tried to analyze my algorithm performances with increasing block size, that's the time graph with corrisponding block size:
With 64 as block size I have 100% of occupancy, however I achive the best performances, so the minumum execution time, with a 32 block size. I'm asking if it's possible to have better performances with less occupancy.
I'm using colab with a Tesla T4, with the following specs:
Upvotes: 0
Views: 162
Reputation: 151944
I'm asking if it's possible to have better performances with less occupancy.
Yes, it's possible, and well regarded papers have been written on that topic.
Explaining whether that makes sense in your particular case, using an incomplete snippet of code, and no information about GPU or execution environment, is not possible.
Upvotes: 2