Monell
Monell

Reputation: 1

Can we get cuda kernel function name in cudaLaunchKernel?

I was trying to insert some codes in cudaLaunchKernel and need to store its function name, but I cannot find a direct API that can help me to get the kernel function name. I have considered CUPTI, but it uses callback function to get the information so I cannot change the behavior of the kernel launch(or need heavy inter-process communication which is ugly.....)

Is there any way I can get the function name in cudaLaunchKernel(maybe by the function pointer?)?

An exampla is as follows.

cudaKernelLaunch(...) {
    kernel_id = getKernelNameBySomeMethods(); // it's what I want..
    send_to_other_processes(kernel_name);
    return ::cudaKernelLaunch(...);
}

// for other process
receive_kernel_name_from_other_process;
store_it;

Edit: A identifier is also ok. I may send the ID to another process to store so I need to classify different cuda kernels.

Upvotes: 0

Views: 1859

Answers (1)

talonmies
talonmies

Reputation: 72339

There are no APIs to do this, either public or private AFAIK. The compiler emits a lot of static host side boilerplate to perform the runtime API magic we take for granted, it isn't done by the runtime library itself.

However, the nature of that boilerplate means you can build your own lookup table pretty easily -- some hacking over a lunch break got me this partial proof of concept which does what I think it is you want:

#include <cstdio>
#include <map>
#include <string>
#include <iostream>

__global__ void kernel_1(float *in, float *out, int N)
{
    int tidx = threadIdx.x + blockDim.x * blockIdx.x;
    if (tidx == 0) printf("Running kernel_1\n");
    if (tidx < N) out[tidx] = in[tidx];
}


__global__ void kernel_2(float *in, float *out, int N)
{
    int tidx = threadIdx.x + blockDim.x * blockIdx.x;
    if (tidx == 0) printf("Running kernel_2\n");
    if (tidx < N) out[tidx] = 2.f * in[tidx];
}

__global__ void kernel_3(float *in, float *out, int N)
{
    int tidx = threadIdx.x + blockDim.x * blockIdx.x;
    if (tidx == 0) printf("Running kernel_3\n");
    if (tidx < N) out[tidx] = 3.f * in[tidx];
}


void notakernel(float *in, float *out, int N)
{
   printf("Someone bad happened\n");
}

std::map <void*, std::string> ktable = {
    { (void*)kernel_1, "kernel_1" },
    { (void*)kernel_2, "kernel_2" },
    { (void*)kernel_3, "kernel_3" } };


cudaError_t MyLaunchKernel (void* func, dim3 gridDim, dim3 blockDim, void** args, size_t sharedMem, cudaStream_t stream)
{
    auto it = ktable.find(func);
    if (it != ktable.end()) {
        std::cout << "Received request to call " << it->second << std::endl;
    } else {
        std::cout << "Received request to call unknown function!" << std::endl;
    }

    return cudaLaunchKernel(func, gridDim, blockDim, args, sharedMem, stream);
}

int main()
{

    int N = 100;
    float* a; cudaMalloc<float>(&a, N * sizeof(float));  
    float* b; cudaMalloc<float>(&b, N * sizeof(float));  
    void* args[] = { (void*)&a, (void*)&b, (void*)&N };

    MyLaunchKernel((void*)kernel_1, dim3(1), dim3(1), args, 0, NULL);
    cudaDeviceSynchronize();

    MyLaunchKernel((void*)kernel_2, dim3(1), dim3(1), args, 0, NULL);
    cudaDeviceSynchronize();

    MyLaunchKernel((void*)kernel_3, dim3(1), dim3(1), args, 0, NULL);
    cudaDeviceSynchronize();

    MyLaunchKernel((void*)notakernel, dim3(1), dim3(1), args, 0, NULL);
    cudaDeviceSynchronize();

    return 0;
}

which appears to work:

$ nvcc -std=c++11 -arch=sm_52  -o lookup lookup.cu
$ cuda-memcheck ./lookup
========= CUDA-MEMCHECK
Received request to call kernel_1
Running kernel_1
Received request to call kernel_2
Running kernel_2
Received request to call kernel_3
Running kernel_3
Received request to call unknown function!
========= Program hit cudaErrorInvalidDeviceFunction (error 98) due to "invalid device function" on CUDA API call to cudaLaunchKernel. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x3b9803]
=========     Host Frame:./lookup [0x4ca95]
=========     Host Frame:./lookup [0x746c]
=========     Host Frame:./lookup [0x769f]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
=========     Host Frame:./lookup [0x722a]
=========
========= ERROR SUMMARY: 1 error

Obviously things need to be a bit more complex in a complete implementation for your use case -- you would require the reverse lookup implementation for another called to go from name/ID to pointer, and if you have multiple source files compiled separately, then you would need a list concatenation call for the construction of the working list at runtime. But it is important to remember that the function pointers you are passing are actually host pointers, not device pointers (thanks to the runtime API magic), so the cost and complexity of runtime setup is trivial when you can use pre-baked C++ standard library containers and algorithms and function adapters to do most of the heavy lifting.

Upvotes: 1

Related Questions