user2398029
user2398029

Reputation: 6937

Launch out of resources

I wrote the following simple CUDA kernel:

__global__ void pr_kernel(float* O, const float* I, const float* W, int N)
{
  int x = threadIdx.x;
  float sum;
  int i;
  if (x < N) {
    for (i = 0; i < N; i++) {
      if (i == x) continue;
      sum += W[x*N+i] * I[x];
    }
    O[x] = (0.15 / N) + 0.85 * sum;
  }
}

The variables are allocated in Python as follows:

N      = np.int32(4)
W      = np.float32(np.asarray(
         [0, 1, 0, 1, 1, 0, 1, 1, 
         0, 1, 0, 1,1, 1, 0]))
I      = np.float32(np.asarray(
         [0.25, 0.25, 0.25, 0.25]))
O      = np.float32(np.zeros(N))

I'm transferring the variables using gpuarray.to_gpu, and I'm calling the kernel on a Tesla C2070 with the following line:

pr_kernel(O_d, I_d, W_d, N_d, block=blocksize, grid=gridsize)

Where:

blocksize = (128, 1, 1)
gridsize = (1, 1)

I get the error message:

pycuda.driver.LaunchError: cuLaunchKernel failed: launch out of resources.

This happens even if I reduce blocksize to something like (8, 1, 1). I can run other CUDA programs on the GPU with a blocksize of (512, 1, 1) so I'm confident this is not due to a GPU configuration issue.

What am I doing wrong? Thanks for any help.

Upvotes: 1

Views: 1311

Answers (2)

kon psych
kon psych

Reputation: 666

I got a similar problem when I used a different type in definition and as an argument to the kernel. Probably the fact that the latter required more resources generates an error.

Upvotes: 0

user2398029
user2398029

Reputation: 6937

The problem was that I was transferring the integer N to the GPU using gpuarray.to_gpu, where I should have been directly passing N to the pr_kernel function.

Upvotes: 1

Related Questions