Reputation: 366
I am working on a CUDA program that uses templates. The kernels would be instantiated with the datatype cuComplex
, or cuDoubleComplex
. Depending on the datatype with which the kernel is instantiated, I need to declare a constant which would reside in the constant memory space of the CUDA device. For implementing this, I've done:
// declared this globally
template <typename T>
__device__ __constant__ T some_constant;
// set the constant through a function call in main().
// kernel templated
template <typename T>
__global__ void kernel()
{
T b_k = cuGet<T>(0.0, 0.0);
T x_k_1 = cuGet<T>(2.0, 2.0);
// cuGet returns a complex no. of type
// cuComplex or cuDoubleComplex depending on T.
b_k = cuAdd(b_k, cuMul(some_constant, x_k_1));
// cuAdd, cuMul, cuGet are all overloaded functions.
// They can take cuComplex, or cuDoubleComplex params.
// Here, some_constant has to cuComplex or cuDoubleComplex, depending
// on the datatype of the other arg x_k_1 to cuMul.
// Therefore, I went about implementing a templated constant.
}
On compiling, this gives an error: "some_constant" is not a function or static data member.
One option to solve this issue could be to define a type conversion from cuDoubleComplex
to cuComplex
, and declare the constant to be of cuDoubleComplex
instead of using it as a template and, typecast the constant wherever it's being used in the kernel.
Is there any other way apart from this?
Thanks in advance.
Upvotes: 4
Views: 1323
Reputation: 2916
If the main request is to avoid the type conversion, you can work around it with a template class with inline device functions (the rest being inspired from proposal by @RobertCrovella in a comment). Finally, a macro will make the call (it is not a perfectly clean design indeed, but keeps the same syntax). Here is an example of how it could work:
template <typename T>
struct holder
{
static __device__ __inline__ T value () ;
static __device__ __inline__ void init (T val) ;
} ;
__constant__ char data [16] ;
__device__ __inline__ int holder<int>::value () { return *((int*)data); }
__device__ __inline__ long holder<long>::value () { return *((long*)data); }
#define some_constant holder<T>::value()
template <typename T>
__global__ void kernel(T* res)
{
*res = some_constant ;
}
int main ()
{
int *dres ;
cudaMalloc <> (&dres, sizeof(int)) ;
int val = 42 ;
cudaMemcpyToSymbol (data, &val, sizeof(int)) ;
kernel<int><<<1,1>>>(dres) ;
int hres ;
cudaMemcpy (&hres, dres, sizeof(int), cudaMemcpyDeviceToHost) ;
printf ("RES = %d\n", hres) ;
}
The holder<T>::value()
call will get inlined, type conversion erased by optimizer, and return the appropriate type from constant memory with no conversion (here the generated ptx):
// .globl _Z6kernelIiEvPT_
.const .align 4 .b8 data[16];
.visible .entry _Z6kernelIiEvPT_(
.param .u32 _Z6kernelIiEvPT__param_0
)
{
.reg .b32 %r<4>;
ld.param.u32 %r1, [_Z6kernelIiEvPT__param_0];
cvta.to.global.u32 %r2, %r1;
ld.const.u32 %r3, [data];
st.global.u32 [%r2], %r3;
ret;
}
The main downside being the macro which expects type to be T
.
Upvotes: 4