gesp
gesp

Reputation: 23

Grid-Stride Loop in cuda and matrix operations, why do we need it?

__global__ void substract(float *A, float *B, float *res, int *n)
{
    int size = *n;
    int tid = threadIdx.x + blockIdx.x*blockDim.x;
    while (tid < size) 
    {
        res[tid] = A[tid] - B[tid];
        tid += blockDim.x * gridDim.x;
    }
}


int function(...) {
    int threadsPerBlock = 256;
    int blocks = (n+threadsPerBlock-1)/threadsPerBlock;
    int blocksPerGrid = 32<blocks ? 32 : blocks;
.
.
.
    substract<<<blocksPerGrid, threadsPerBlock>>>(A, B, res, n);
.
.
.
}

So I wrote this code that takes an array A that represents a matrix of size nxn and a second array B that represents the vector of size n and I subtract one from the other. Let's say the size of this array is 1000x1000. I wrote it kind of by following the examples on multiple cuda guides, but I can't understand why we need this part: tid += blockDim.x * gridDim.x;

Since it will never fit as array id element, it will always be larger than or equal to 1024 and my array only has 0-999 id's, it seems useless to me, but without it my program crashes, the screen turns black and after a few seconds it returns and I get the pop up that drivers have recovered. So I tried to understand why I can't just go through the whole array with the tid = threadIdx.x + blockIdx.x*blockDim.x;. I printed all the tids before the while loop and it seems it just goes all the way from 0 to 1024 in random order since it can't count on tid += blockDim.x * gridDim.x; to calculate anything inside my array boundaries I guess.

Upvotes: 0

Views: 1454

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 151879

If your array size is equal to or smaller than your grid size, then the grid stride loop doesn't provide much benefit in that case. It is particularly useful when the array size is bigger than the grid size, or when you want to write a kernel that can flexibly handle arbitrary array sizes, without having to adjust your grid size.

However the grid-stride addition code may still be needed if your array size is less than or equal to the grid size. The reason for this will become evident if you think carefully about your while-loop: this addition operation is needed to cause the while-loop to terminate on all threads.

Suppose your array size is 1024 and it is equal to your grid size of 1024 threads (whether all in one block or not; doesn't matter).

Initially your threads will have tid indices of 0-1023. none of these values cause the while loop to terminate. If the while loop never terminates, your kernel will hang and run forever until or unless you have a kernel timeout (which is what you are seeing).

But with the addition statement, after the first while-loop iteration, each thread has a tid value of 1024 or greater, which will cause the while loop to terminate for all threads (assuming size is 1024 or less).

Upvotes: 1

Related Questions