Reputation: 1849
I have several CUDA Kernels which are basically doing the same with some variations. What I would like to do is to reduce the amout of code needed. My first thought was to use macros, so my resulting kernels would look like this (simplified):
__global__ void kernelA( ... )
{
INIT(); // macro to initialize variables
// do specific stuff for kernelA
b = a + c;
END(); // macro to write back the result
}
__global__ void kernelB( ... )
{
INIT(); // macro to initialize variables
// do specific stuff for kernelB
b = a - c;
END(); // macro to write back the result
}
...
Since macros are nasty, ugly and evil I am looking for a better and cleaner way. Any suggestions?
(A switch statement would not do the job: In reality, the parts which are the same and the parts which are kernel specific are pretty interweaved. Several switch statements would be needed which would make the code pretty unreadable. Furthermore, function calls would not initialize the needed variables. )
(This question might be answerable for general C++ as well, just replace all 'CUDA kernel' with 'function' and remove '__global__' )
Upvotes: 1
Views: 238
Reputation: 24596
Updated: I was told in the comments, that classes and inheritance don't mix well with CUDA. Therefore only the first part of the answer applies to CUDA, while the others are answer to the more general C++ part of your question.
For CUDA, you will have to use pure functions, "C-style":
struct KernelVars {
int a;
int b;
int c;
};
__device__ void init(KernelVars& vars) {
INIT(); //whatever the actual code is
}
__device__ void end(KernelVars& vars) {
END(); //whatever the actual code is
}
__global__ void KernelA(...) {
KernelVars vars;
init(vars);
b = a + c;
end(vars);
}
This is the answer for general C++, where you would use OOP techniques like constructors and destructors (they are perfectly suited for those init/end pairs), or the template method pattern which can be used with other languages as well:
Using ctor/dtor and templates, "C++ Style":
class KernelBase {
protected:
int a, b, c;
public:
KernelBase() {
INIT(); //replace by the contents of that macro
}
~KernelBase() {
END(); //replace by the contents of that macro
}
virtual void run() = 0;
};
struct KernelAdd : KernelBase {
void run() { b = a + c; }
};
struct KernelSub : KernelBase {
void run() { b = a - c; }
};
template<class K>
void kernel(...)
{
K k;
k.run();
}
void kernelA( ... ) { kernel<KernelAdd>(); }
Using template method pattern, general "OOP style"
class KernelBase {
virtual void do_run() = 0;
protected:
int a, b, c;
public:
void run() { //the template method
INIT();
do_run();
END();
}
};
struct KernelAdd : KernelBase {
void do_run() { b = a + c; }
};
struct KernelSub : KernelBase {
void do_run() { b = a - c; }
};
void kernelA(...)
{
KernelAdd k;
k.run();
}
Upvotes: 5
Reputation: 5482
You can use device functions as "INIT()" and "END()" alternative.
__device__ int init()
{
return threadIdx.x + blockIdx.x * blockDim.x;
}
Another alternative is to use function templates:
#define ADD 1
#define SUB 2
template <int __op__> __global__ void caluclate(float* a, float* b, float* c)
{
// init code ...
switch (__op__)
{
case ADD:
c[id] = a[id] + b[id];
break;
case SUB:
c[id] = a[id] - b[id];
break;
}
// end code ...
}
and invoke them using:
calcualte<ADD><<<...>>>(a, b, c);
The CUDA compiler does the work, build the different function versions and removes the dead code parts for performance optimization.
Upvotes: 1