Hailiang Zhang
Hailiang Zhang

Reputation: 18860

How to avoid using number of threads exceeding the maximum allowed on GPU?

As described in a previous post: how to find the number of maximum available threads in CUDA? I found the maximum number of threads on my GPU card is 21504. However, when I assigned more than that number to the kernel, everything runs smoothly.

#include <stdio.h>
#include <cuda_runtime.h>

__global__ void dummy()
{
}

int main()
{
    //int N=21504;
    int N=21504*40;
    dummy<<<1,N>>>();
    return 0;
}

I don't know what happened, but I believe we should avoid this, and not sure how to do it.

Upvotes: 0

Views: 1471

Answers (1)

Roger Dahl
Roger Dahl

Reputation: 15724

Your example did not run correctly. It only appeared to run correctly because you did not check the CUDA error status after the kernel launch.

The comment I made on your other question also applies here:

The maximum number of threads per multiprocessor is the upper limit to how many threads can be "in flight" at the same time. Other limiting factors will normally limit the number further. This value does not affect how many threads can be launched at the same time and it is not very useful for finding out the number of threads needed for optimal performance.

Your card is a compute capability 2.0 device. See the Features and Technical Specifications section in the CUDA Programming Guide for details on the limitations of your device. In particular, your device is limited to a grid size of 65535 in each of the X, Y and Z dimensions. You attempted to launch with a grid size of X = 21504*40, Y = 1, Z = 1.

Your device is limited to 1024 threads per block. So, in theory, you can launch up to 65535 * 65535 * 65535 blocks, each with 1024 threads at the same time.

There is no performance penalty to launching kernels with many more threads than the maximum number of resident threads your device supports.

Upvotes: 3

Related Questions