AstrOne
AstrOne

Reputation: 3769

CUDA: How to return a device lambda from a host function

I have a virtual function which returns a different lambda depending on the derived class:

class Base
{
public:
    virtual std::function<float()> foo(void) = 0;
};

class Derived : public Base
{
public:
    std::function<float()> foo(void) {
        return [] __device__ (void) {
            return 1.0f;
        };
    }
};

Then I want to pass this lambda to a CUDA kernel and call it from the device. In other words, I want to do this:

template<typename Func>
__global__ void kernel(Func f) {
    f();
}

int main(int argc, char** argv)
{
    Base* obj = new Derived;
    kernel<<<1, 1>>>(obj->foo());
    cudaDeviceSynchronize();
    return 0;
}

Tha above give an error like this: calling a __host__ function("std::function<float ()> ::operator ()") from a __global__ function("kernel< ::std::function<float ()> > ") is not allowed

As you can see, I declare my lambda as __device__, but the foo() method stores it in a std::function in order to return it. As a result, what is passed to the kernel() is a host address and of course it does not work. I guess that is my problem, right? So my questions are:

So far, from the quick research I did, CUDA does not have/support the necessary syntax required to make a function return a device lambda. I just hope I am wrong. :) Any ideas?

Thanks in advance

Upvotes: 5

Views: 1469

Answers (2)

tdoublep
tdoublep

Reputation: 33

While I don't think you can achieve what you want using virtual functions that return device lambdas, you can achieve something similar by passing a static device member function as the template parameter to your kernel. An example is provided below. Note that the classes in this example could also be structs if you prefer.

#include <iostream>

// Operation: Element-wise logarithm
class OpLog {
    public:
    __device__ static void foo(int tid, float * x) {
        x[tid] = logf(x[tid]);
    };
};

// Operation: Element-wise exponential
class OpExp {
    public:
    __device__ static void foo(int tid, float * x) {
        x[tid] = expf(x[tid]);
    }
};

// Generic kernel
template < class Op >
__global__ void my_kernel(float * x) {
    int tid = threadIdx.x;
    Op::foo(tid,x);
}

// Driver
int main() {

    using namespace std;

    // length of vector
    int len = 10;

    // generate data
    float * h_x = new float[len];
    for(int i = 0; i < len; i++) {
        h_x[i] = rand()/float(RAND_MAX);
    }

    // inspect data
    cout << "h_x = [";
    for(int j = 0; j < len; j++) {
        cout << h_x[j] << " ";
    }
    cout << "]" << endl;

    // copy onto GPU
    float * d_x;
    cudaMalloc(&d_x, len*sizeof(float));
    cudaMemcpy(d_x, h_x, len*sizeof(float), cudaMemcpyHostToDevice);

    // Take the element-wise logarithm
    my_kernel<OpLog><<<1,len>>>(d_x);

    // get result
    cudaMemcpy(h_x, d_x, len*sizeof(float), cudaMemcpyDeviceToHost);
    cout << "h_x = [";
    for(int j = 0; j < len; j++) {
        cout << h_x[j] << " ";
    }
    cout << "]" << endl;

    // Take the element-wise exponential
    my_kernel<OpExp><<<1,len>>>(d_x);

    // get result
    cudaMemcpy(h_x, d_x, len*sizeof(float), cudaMemcpyDeviceToHost);
    cout << "h_x = [";
    for(int j = 0; j < len; j++) {
        cout << h_x[j] << " ";
    }
    cout << "]" << endl;


}

Upvotes: 1

einpoklum
einpoklum

Reputation: 131976

Before actually answering, I have to wonder whether your question isn't an XY problem. That is, I am by default skeptical that people have a good excuse for executing code through lambdas/function pointers on the device.

But I won't evade your question like that...

Is it somehow possible to create a __device__ std::function and return that from the foo() method?

Short answer: No, try something else.

Longer answer: If you want to implement a large chunk of the standard library on the device side, then maybe you could have a device-side std::function-like class. But I'm not sure that's even possible (quite possibly not), and anyway - it's beyond the capabilities of everyone except very seasoned library developers. So, do something else.

If this is not possible, is there any other way to dynamically select a lambda and pass it to the CUDA kernel? Hard-coding multiple calls to kernel() with all the possible lambdas is not an option.

First, remember that lambdas are essentially anonymous classes - and thus, if they don't capture anything, they're reducible to function pointers since the anonymous classes have no data, just an operator().

So if the lambdas have the same signature and no capture, you can cast them into a (non-member-)function pointer and pass those to the function; and this definitely works, see this simple example on nVIDIA's forums.

Another possibility is using run-time mapping from type id's or other such keys into instances of these types, or rather, to constructors. That is, using a factory. But I don't want to get into the details of that to not make this answer longer than it already is; and it's probably not a good idea.

Upvotes: 2

Related Questions