Roberto Falcone
Roberto Falcone

Reputation: 46

CUDA: Better performances with lower occupancy

I'm a CUDA learning student and I'm trying to write a CUDA algorithm for counting sort:

__global__ void kernelCountingSort(int *array, int dim_array, int *counts) {
    // define index
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int count = 0;
    // check that the thread is not out of the vector boundary
    if (i >= dim_array) return;

    for (int j = 0; j < dim_array; ++j) {
        if (array[j] < array[i])
            count++;
        else if (array[j] == array[i] && j < i)
            count++;
    }
    counts[count] = array[i];

I tried to analyze my algorithm performances with increasing block size, that's the time graph with corrisponding block size:

enter image description here

With 64 as block size I have 100% of occupancy, however I achive the best performances, so the minumum execution time, with a 32 block size. I'm asking if it's possible to have better performances with less occupancy.

I'm using colab with a Tesla T4, with the following specs: enter image description here

Upvotes: 0

Views: 162

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 151944

I'm asking if it's possible to have better performances with less occupancy.

Yes, it's possible, and well regarded papers have been written on that topic.

Explaining whether that makes sense in your particular case, using an incomplete snippet of code, and no information about GPU or execution environment, is not possible.

Upvotes: 2

Related Questions