Reputation: 16816
I am trying to define a template CUDA kernel for logical operations on an image. The code looks like this:
#define AND 1
#define OR 2
#define XOR 3
#define SHL 4
#define SHR 5
template<typename T, int opcode>
__device__ inline T operation_lb(T a, T b)
{
switch(opcode)
{
case AND:
return a & b;
case OR:
return a | b;
case XOR:
return a ^ b;
case SHL:
return a << b;
case SHR:
return a >> b;
default:
return 0;
}
}
//Logical Operation With A Constant
template<typename T, int channels, int opcode>
__global__ void kernel_logical_constant(T* src, const T val, T* dst, int width, int height, int pitch)
{
const int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
const int yIndex = blockIdx.y * blockDim.y + threadIdx.y;
if(xIndex >= width || yIndex >= height) return;
unsigned int tid = yIndex * pitch + (channels * xIndex);
#pragma unroll
for(int i=0; i<channels; i++)
dst[tid + i] = operation_lb<T,opcode>(src[tid + i],val);
}
The problem is that when I instantiate the kernel for bit shifting, the following compilation error arises
Error 1 error : Ptx assembly aborted due to errors
The kernel instants are like this:
template __global__ void kernel_logical_constant<unsigned char,1,SHL>(unsigned char*,unsigned char,unsigned char*,int,int,int);
There are 19 more instants like this for unsigned char
, unsigned short
, 1 and 3 channels and all logical operations. But only the bit shifting instances, i.e. SHL
and SHR
cause error. When I remove these instances, the code compiles and works perfectly.
The code also works if I replace the bit shifting with any other operation inside the operation_lb
device function.
I was wondering if this had anything to do with the amount of ptx code generated due to so many different instances of the kernel.
I am using CUDA 5.5, Visual Studio 2010, Windows 8 x64. Compiling for compute_1x, sm_1x
.
Any help would be appreciated.
Upvotes: 1
Views: 1038
Reputation: 152174
The original question specified that the poster was using compute_20, sm_20
. With that, I was not able to reproduce the error using the code here. However, in the comments it was pointed out that actually sm_10
was being used. When I switch to compiling for sm_10
I am able to reproduce the error.
It appears to be a bug in the compiler. I say this simply because I do not believe that the compiler should generate code that the assembler cannot handle. However beyond that I have no knowledge of the underlying root cause. I have filed a bug report with NVIDIA.
In my limited testing, it seems to only happen with unsigned char
not int
.
As a possible workaround, for cc2.0 and newer devices, specify -arch=sm_20
when compiling.
Upvotes: 2