What is the total thread count(executed over time, not parallel) for CUDA?

Question

I need to execute a function about 10^11 times. The function is self-contained and requires one integer as input, let's call it f(n). The range of n is in fact 0 < n < 10^11. We can ignore inclusion of endpoints, I just need the concept about running something of this magnitude in terms of indexes on CUDA.

I want to run this function using CUDA, but I have troubles conceptually. Namely, I know how to simulate my n, mentioned above, using the blocks and threads indexes. As shown in slide 40 of, nVidia Tutorial But, what happens when n>TotalNumberOfThreadsPer_CUDA_Call.

Essentially, does the thread count and block count reset for every call I make to run functions on CUDA? If so, is there a simple way to simulate n, as described earlier, for arbitrarily large n?

Thanks.

user703016 · Accepted Answer

A common pattern when you want to process more elements than there are threads is to simply loop over your data in grid-sized chunks:

__global__ void kernel(int* data, size_t size) {
    for (size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
         idx < size;
         idx += gridDim.x * blockDim.x) {
        // do something with data[idx] ...
    }
}

Another option is to launch several consecutive kernels with a start offset:

__global__ void kernel(int* data, size_t size, size_t offset) {
    size_t idx = blockIdx.x * blockDim.x + threadIdx.x + offset;

    if (idx < size) {
        // do something with data[idx] ...
    }
}

// Host code
dim3 gridSize = ...;
dim3 blockSize = ...;
for (size_t offset = 0; offset < totalWorkSize; offset += gridSize * blockSize) {
    kernel<<>>(data, totalWorkSize, offset);
}

In both cases, you can process an "arbitrarily large" number of elements. You're still limited by size_t, so for 10^11 elements you will need to compile your code for 64 bits.

What is the total thread count(executed over time, not parallel) for CUDA?

Answers (2)

Related Questions