Reputation: 3769
I have a virtual function which returns a different lambda depending on the derived class:
class Base
{
public:
virtual std::function<float()> foo(void) = 0;
};
class Derived : public Base
{
public:
std::function<float()> foo(void) {
return [] __device__ (void) {
return 1.0f;
};
}
};
Then I want to pass this lambda to a CUDA kernel and call it from the device. In other words, I want to do this:
template<typename Func>
__global__ void kernel(Func f) {
f();
}
int main(int argc, char** argv)
{
Base* obj = new Derived;
kernel<<<1, 1>>>(obj->foo());
cudaDeviceSynchronize();
return 0;
}
Tha above give an error like this: calling a __host__ function("std::function<float ()> ::operator ()") from a __global__ function("kernel< ::std::function<float ()> > ") is not allowed
As you can see, I declare my lambda as __device__
, but the foo()
method stores it in a std::function
in order to return it. As a result, what is passed to the kernel()
is a host address and of course it does not work. I guess that is my problem, right? So my questions are:
Is it somehow possible to create a __device__ std::function
and return that from the foo()
method?
If this is not possible, is there any other way to dynamically select a lambda and pass it to the CUDA kernel? Hard-coding multiple calls to kernel()
with all the possible lambdas is not an option.
So far, from the quick research I did, CUDA does not have/support the necessary syntax required to make a function return a device lambda. I just hope I am wrong. :) Any ideas?
Thanks in advance
Upvotes: 5
Views: 1469
Reputation: 33
While I don't think you can achieve what you want using virtual functions that return device lambdas, you can achieve something similar by passing a static device member function as the template parameter to your kernel. An example is provided below. Note that the classes in this example could also be structs if you prefer.
#include <iostream>
// Operation: Element-wise logarithm
class OpLog {
public:
__device__ static void foo(int tid, float * x) {
x[tid] = logf(x[tid]);
};
};
// Operation: Element-wise exponential
class OpExp {
public:
__device__ static void foo(int tid, float * x) {
x[tid] = expf(x[tid]);
}
};
// Generic kernel
template < class Op >
__global__ void my_kernel(float * x) {
int tid = threadIdx.x;
Op::foo(tid,x);
}
// Driver
int main() {
using namespace std;
// length of vector
int len = 10;
// generate data
float * h_x = new float[len];
for(int i = 0; i < len; i++) {
h_x[i] = rand()/float(RAND_MAX);
}
// inspect data
cout << "h_x = [";
for(int j = 0; j < len; j++) {
cout << h_x[j] << " ";
}
cout << "]" << endl;
// copy onto GPU
float * d_x;
cudaMalloc(&d_x, len*sizeof(float));
cudaMemcpy(d_x, h_x, len*sizeof(float), cudaMemcpyHostToDevice);
// Take the element-wise logarithm
my_kernel<OpLog><<<1,len>>>(d_x);
// get result
cudaMemcpy(h_x, d_x, len*sizeof(float), cudaMemcpyDeviceToHost);
cout << "h_x = [";
for(int j = 0; j < len; j++) {
cout << h_x[j] << " ";
}
cout << "]" << endl;
// Take the element-wise exponential
my_kernel<OpExp><<<1,len>>>(d_x);
// get result
cudaMemcpy(h_x, d_x, len*sizeof(float), cudaMemcpyDeviceToHost);
cout << "h_x = [";
for(int j = 0; j < len; j++) {
cout << h_x[j] << " ";
}
cout << "]" << endl;
}
Upvotes: 1
Reputation: 131976
Before actually answering, I have to wonder whether your question isn't an XY problem. That is, I am by default skeptical that people have a good excuse for executing code through lambdas/function pointers on the device.
But I won't evade your question like that...
Is it somehow possible to create a
__device__ std::function
and return that from the foo() method?
Short answer: No, try something else.
Longer answer: If you want to implement a large chunk of the standard library on the device side, then maybe you could have a device-side std::function
-like class. But I'm not sure that's even possible (quite possibly not), and anyway - it's beyond the capabilities of everyone except very seasoned library developers. So, do something else.
If this is not possible, is there any other way to dynamically select a lambda and pass it to the CUDA kernel? Hard-coding multiple calls to kernel() with all the possible lambdas is not an option.
First, remember that lambdas are essentially anonymous classes - and thus, if they don't capture anything, they're reducible to function pointers since the anonymous classes have no data, just an operator()
.
So if the lambdas have the same signature and no capture, you can cast them into a (non-member-)function pointer and pass those to the function; and this definitely works, see this simple example on nVIDIA's forums.
Another possibility is using run-time mapping from type id's or other such keys into instances of these types, or rather, to constructors. That is, using a factory. But I don't want to get into the details of that to not make this answer longer than it already is; and it's probably not a good idea.
Upvotes: 2