Grid-Stride Loop in cuda and matrix operations, why do we need it?

Question

__global__ void substract(float *A, float *B, float *res, int *n)
{
    int size = *n;
    int tid = threadIdx.x + blockIdx.x*blockDim.x;
    while (tid < size) 
    {
        res[tid] = A[tid] - B[tid];
        tid += blockDim.x * gridDim.x;
    }
}


int function(...) {
    int threadsPerBlock = 256;
    int blocks = (n+threadsPerBlock-1)/threadsPerBlock;
    int blocksPerGrid = 32>>(A, B, res, n);
.
.
.
}

So I wrote this code that takes an array A that represents a matrix of size nxn and a second array B that represents the vector of size n and I subtract one from the other. Let's say the size of this array is 1000x1000. I wrote it kind of by following the examples on multiple cuda guides, but I can't understand why we need this part: tid += blockDim.x * gridDim.x;

Since it will never fit as array id element, it will always be larger than or equal to 1024 and my array only has 0-999 id's, it seems useless to me, but without it my program crashes, the screen turns black and after a few seconds it returns and I get the pop up that drivers have recovered. So I tried to understand why I can't just go through the whole array with the tid = threadIdx.x + blockIdx.x*blockDim.x;. I printed all the tids before the while loop and it seems it just goes all the way from 0 to 1024 in random order since it can't count on tid += blockDim.x * gridDim.x; to calculate anything inside my array boundaries I guess.

Grid-Stride Loop in cuda and matrix operations, why do we need it?

Answers (1)

Related Questions