Controlling the threads that go into CUDA function from OpenACC compute region

Question

Im calling a cuda function from OpenACC compute region, and I want to specify the number of threads that should go into the cuda function, but it seems that I couldn't figure how to control that.

%main.cpp
..

#pragma acc routine vector
extern "C" void CUDA_KERNEL_FUNCTION(double *B, int ldb,const double *A, int lda);
..
#pragma acc parallel loop independent collapse(3) gang vector(128)
  for(int i0 = 0; i0 < size0 - 31; i0+= 32)
     for(int i1 = 0; i1 < size1 - 31; i1+= 32)
        for(int i2 = 0; i2 < size2; i2+= 1)
          CUDA_KERNEL_FUNCTION(B, ldb, A, lda);

..
..


%cuda_code.cu

extern "C" __device__ void CUDA_KERNEL_FUNCTION(double *B, int ldb,const double *A, int lda)
{

    Num_Threads_gpu = blockDim.x * blockDim.y* blockDim.z;

    //Num_Threads_gpu is always 32 
}

The compilation is fine. But No matter what vector length I use, the number of threads that go into the cuda function is always 32. Is there any way to specify that?

I using "cuda/7.0.28" and "pgi/15.10"

Thanks

jefflarkin · Accepted Answer

Try changing vector(128) to vector_length(128). I think PGI 15.10 supports both syntaxes, but just in case...

If that doesn't work, can you please post the compiler output with -Minfo=accel so that we can see what the compiler is doing?

Controlling the threads that go into CUDA function from OpenACC compute region

Answers (1)

Related Questions