CUDA kernel template instantiation causing compilation error

Question

I am trying to define a template CUDA kernel for logical operations on an image. The code looks like this:

#define AND 1
#define OR 2
#define XOR 3
#define SHL  4
#define SHR 5 

template 
__device__ inline T operation_lb(T a, T b)
{
    switch(opcode)
    {
    case AND:
        return a & b;
    case OR:
        return a | b;
    case XOR:
        return a ^ b;
    case SHL:
        return a << b;
    case SHR:
        return a >> b;
    default:
        return 0;
    }
}

//Logical Operation With A Constant
template 
__global__ void kernel_logical_constant(T* src, const T val, T* dst, int width, int height, int pitch)
{
    const int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
    const int yIndex = blockIdx.y * blockDim.y + threadIdx.y;

    if(xIndex >= width || yIndex >= height) return;

    unsigned int tid = yIndex * pitch + (channels * xIndex);

    #pragma unroll
    for(int i=0; i(src[tid + i],val);
}

The problem is that when I instantiate the kernel for bit shifting, the following compilation error arises

Error 1 error : Ptx assembly aborted due to errors

The kernel instants are like this:

template __global__ void kernel_logical_constant(unsigned char*,unsigned char,unsigned char*,int,int,int);

There are 19 more instants like this for unsigned char, unsigned short, 1 and 3 channels and all logical operations. But only the bit shifting instances, i.e. SHL and SHR cause error. When I remove these instances, the code compiles and works perfectly. The code also works if I replace the bit shifting with any other operation inside the operation_lb device function. I was wondering if this had anything to do with the amount of ptx code generated due to so many different instances of the kernel.

I am using CUDA 5.5, Visual Studio 2010, Windows 8 x64. Compiling for compute_1x, sm_1x.

Any help would be appreciated.

Robert Crovella · Accepted Answer

The original question specified that the poster was using compute_20, sm_20. With that, I was not able to reproduce the error using the code here. However, in the comments it was pointed out that actually sm_10 was being used. When I switch to compiling for sm_10 I am able to reproduce the error.

It appears to be a bug in the compiler. I say this simply because I do not believe that the compiler should generate code that the assembler cannot handle. However beyond that I have no knowledge of the underlying root cause. I have filed a bug report with NVIDIA.

In my limited testing, it seems to only happen with unsigned char not int.

As a possible workaround, for cc2.0 and newer devices, specify -arch=sm_20 when compiling.

CUDA kernel template instantiation causing compilation error

Answers (1)

Related Questions