malxmusician212
malxmusician212

Reputation: 113

GPU Kernel Blocksize/Gridsize without Threads

I'm currently programming some numerical methods on a gpu via pycuda/cuda and am writing my own kernels. At some point, i need to estimate error for at least 1000 coupled ODE's. I don't want to have to copy a couple of vectors with over 1000 entries, so i created a kernel (at the bottom of the post) that is a basic max function. These %(T)s and %(N)s are string substitutions i'm making at runtime, which should be irrelevant for this question (T represents a complex datatype and N represents the number of coupled ODE's).

My question is: there is no need for parallel computation, so i do not use threads. When I call this function in python, what should I specify to be the blocksize or gridsize?

        __global__ void get_error(double *max_error,%(T)s error_vec[1][%(N)s])
    {
        max_error[0]=error_vec[0][0].real();
        for(int ii=0;ii<%(N)s;ii=ii+1)
        {
            if(max_error[0] < error_vec[0][ii].real())
            {
                max_error[0]=error_vec[0][ii].real();
            }
        }
        return;
    }

Upvotes: 0

Views: 81

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 151944

In a kernel launch, the total number of threads that will be spun up on the GPU is equal to the product of the grid size and block size specified for the launch.

Both of these values must be positive integers, therefore the only possible combination of these is 1,1 to create a launch of a single thread.

CUDA kernels are not required to make any specific reference to the builtin variables (e.g. blockIdx, threadIdx etc.) but normally do so in order to differentiate behavior amongst threads. In the case where you have only one thread being launched, there's no particular reason to use these variables, and its not necessary to do so.

A CUDA kernel launch of only a single thread is not a performant method for getting work done, but there may be specific cases where it is convenient to do so and does not have a significant performance impact on the application as a whole.

It's not obvious to me why your proposed kernel couldn't be recast as a thread-parallel kernel (it appears to be performing a max-finding reduction), but that seems to be separate from the point of your question.

Upvotes: 1

Related Questions