Namespaces as template parameters in CUDA

Question

In C++, it is impossible to pass a namespace as some sort of parameter (by means of templates or actual function parameters) to a class or function is not possible. The same applies to CUDA (at least, to my knowledge). Some reasons are explained in this question: Why can't namespaces be template parameters?

Here is an example of a use case:

namespace experiment1
{
    int repetitions() { return 2; }
    void setup() { ... }
    void f() { ... }
    void teardown() { ... }
}

namespace experiment2
{
    int repetitions() { return 4; }
    void setup() { ... }
    void f() { ... }
    void teardown() { ... }
}

// Beware, this is invalid C++ and invalid CUDA
template
void do_test()
{
    // Do something with NS::repetitions(), NS::setup(), ...
}

One of the reasons why this is not valid in C++ is that there is nothing in this approach that you could not do with classes. You could indeed turn every namespace into a class and the functions into member functions, then pass the class as template parameter to the do_test function, or an instance of it as parameter to the same function (possibly using static functions in the former case or virtual functions in the latter case).

I agree with this. However, in the specific case of CUDA, there is something you could do with namespaces, but not with classes. Imagine f is a kernel, i.e. a __global__ function, and that setup or another function is used to specify, e.g. the size of the shared memory to be allocated for the kernel. Kernels cannot be members of classes (see the answer to this question: Can CUDA kernels be virtual functions?). However, you can enclose it with the other functions related to the same experiment in the same namespace.

Consider the case shown in the code above: do_test is the function which sets up the timers, prepares some input, checks the output, measures the time and does some other operations. Every experiment is a set of a few functions, with the same name and same interface, one of which is a kernel. You want do_test to be generic enough to handle all these experiments. And you want the code for each experiment to be self-contained in some form of encapsulation like namespace, struct, class,...

Can this problem be solved?

As requested by talonmies (many thanks for your comment, by the way), I will make the question more concrete.

I have several very simple kernels which perform similar operations. They load values form one large array, they apply a stencil operation on them and they write the result onto an output array (different from the input array). By stencil operation, I mean an operation performed by thread idx on input values idx, and its neighboring values (say from idx-3 to idx+3). The simplest of these kernels performs just a copy from input to output: every thread reads input[idx] and writes output[idx]. Another example is a difference stencil which performs output[idx] = input[idx+1] - input[idx-1]. (I'm leaving apart some details, but you get the idea.)

I want to benchmark these kernels so to derive a performance model. For each kernel I also need a host function which is able to check the result. I also have in each case another kernel which performs the same operation in a slightly different way through an optimization, but equivalent from the point of view of the result. Finally, I have a host function which prints the name of the kernel. Here is an summary in code:

namespace copy
{
    std::string name() { return "copy"; }
    __global__ void kernel(const float* input, float* output, int size);
    __global__ void kernelOptimized(const float* input, float* output, int size);
    bool check(const float* input, const float* output);
}

namespace difference
{
    std::string name() { return "difference"; }
    __global__ void kernel(const float* input, float* output, int size);
    __global__ void kernelOptimized(const float* input, float* output, int size);
    bool check(const float* input, const float* output);
}

I have a function do_test, that I parametrized to be generic:

typedef bool NameFunction(const float* input, const float* output);
typedef bool CheckFunction(const float* input, const float* output);
typedef void KernelFunction(const float* input, float* output, int size);

void do_test(NameFunction name, KernelFunction kernel1, KernelFunction kernel2, CheckFunction check)
{
    // Set up input and output array
    // Set up CUDA events
    // Warm up kernels
    // Run kernels
    // Check results
    // Measure time
    // Do standard output
}

int main()
{
    do_test()
    do_test()
}

Now, of course this way is already quite good. However, if I introduce one more function that every experiment must provide, I will need to modify all these lines where I call do_test. I would prefer passing this namespace or some sort of object containing these functions.

m.s. · Accepted Answer

You could modify your kernels to be "just" __device__ function that are then called through a kernel_wrapper:

#include 
#include 


typedef void (*kernel_ptr)(const float* input, float* output, int size);

template 
__global__
void kernel_wrapper(const float* input, float* output, int size)
{
    kernel(input, output, size);
}

struct copy
{
    std::string name() { return "copy"; }
    __device__ static void kernel(const float* input, float* output, int size){ printf("copy: %d
",threadIdx.x); }
    __device__ static void kernelOptimized(const float* input, float* output, int size){ printf("copy optimized: %d
",threadIdx.x); }
};

struct difference
{
    std::string name() { return "difference"; }

    __device__ static void kernel(const float* input, float* output,i nt size){ printf("difference: %d
",threadIdx.x); }
    __device__ static void kernelOptimized(const float* input, float* output, int size){ printf("difference optimized: %d
",threadIdx.x); }
};

template 
void do_test()
{
    dim3 dimBlock( 4, 1 );
    dim3 dimGrid( 1, 1 );
    Experiment e;

    std::cout << "running experiment " << e.name() << std::endl;
    std::cout << "launching the normal kernel" << std::endl;
    kernel_wrapper<<>>(0,0,0);
    cudaDeviceSynchronize();
    std::cout << "launching the optimized kernel" << std::endl;
    kernel_wrapper<<>>(0,0,0);
    cudaDeviceSynchronize();
}


int main()
{
    do_test();
    do_test();
    return 0;
}

output:

running experiment copy
launching the normal kernel
copy: 0
copy: 1
copy: 2
copy: 3
launching the optimized kernel
copy optimized: 0
copy optimized: 1
copy optimized: 2
copy optimized: 3
running experiment difference
launching the normal kernel
difference: 0
difference: 1
difference: 2
difference: 3
launching the optimized kernel
difference optimized: 0
difference optimized: 1
difference optimized: 2
difference optimized: 3

Alternatively, you could use a combination of CRTP and template specialization:

#include 
#include 


template 
__global__ void f();

template 
struct experiment
{
    void run()
    {
        int blocksize = static_cast(this)->blocksize();
        int reps = static_cast(this)->repetitions();
        for (int i = 0; i<<>>();
        }
        cudaDeviceSynchronize();
    }
};

struct experiment1 : experiment
{
    int repetitions() { return 2; }
    int blocksize() { return 4; }
    experiment1() { std::cout << "setting up experiment 1" << std::endl; }
    ~experiment1() {  std::cout << "shutting down experiment 1" << std::endl;  }
};

template <>
__global__
void f()
{
    printf("experiment1: %d
",threadIdx.x);
}


struct experiment2 : experiment
{
    int repetitions() { return 4; }
    int blocksize() { return 2; }
    experiment2() { std::cout << "setting up experiment 2" << std::endl; }
    ~experiment2() {  std::cout << "shutting down experiment 2" << std::endl;  }
};

template <>
__global__
void f()
{
    printf("experiment2: %d
",threadIdx.x);
}

template
void do_test()
{
    Experiment e;
    e.run();
}

#include 
#include 


template 
__global__ void f();

template 
struct experiment
{
    void run()
    {
        int blocksize = static_cast(this)->blocksize();
        int reps = static_cast(this)->repetitions();
        for (int i = 0; i<<>>();
        }
        cudaDeviceSynchronize();
    }
};

struct experiment1 : experiment
{
    int repetitions() { return 2; }
    int blocksize() { return 4; }
    experiment1() { std::cout << "setting up experiment 1" << std::endl; }
    ~experiment1() {  std::cout << "shutting down experiment 1" << std::endl;  }
};

template <>
__global__
void f()
{
    printf("experiment1: %d
",threadIdx.x);
}


struct experiment2 : experiment
{
    int repetitions() { return 4; }
    int blocksize() { return 2; }
    experiment2() { std::cout << "setting up experiment 2" << std::endl; }
    ~experiment2() {  std::cout << "shutting down experiment 2" << std::endl;  }
};

template <>
__global__
void f()
{
    printf("experiment2: %d
",threadIdx.x);
}

template
void do_test()
{
    Experiment e;
    e.run();
}

int main()
{
    do_test();
    do_test();
    return 0;
}

output

setting up experiment 1
experiment1: 0
experiment1: 1
experiment1: 2
experiment1: 3
experiment1: 0
experiment1: 1
experiment1: 2
experiment1: 3
shutting down experiment 1
setting up experiment 2
experiment2: 0
experiment2: 1
experiment2: 0
experiment2: 1
experiment2: 0
experiment2: 1
experiment2: 0
experiment2: 1
shutting down experiment 2

Namespaces as template parameters in CUDA

Answers (1)

Related Questions