mEm
mEm

Reputation: 373

OpenCL (in-kernel) callable SVD kernel code?

I'm studying how to offload some quite heavy calculations on GPUs. Although on my machine I have a NVIDIA RTX GPU, I would like to avoid using CUDA in order to develop something portable on other GPUs as well (at least in its core).

Thus the choice of OpenCL.

Now, my current biggest concern is that, within the core that is suitable for offload I intensively make use of LAPACK SVD implementation. However, in OpenCL, kernel code cannot either:

  1. Be linked to external libraries. There's a "workaraound" using clEnqueueNativeKernel(), but this does not seem to apply in this case (call within a kernel itself) (not to mention this is not very portable, since it is needed the device to support CL_EXEC_NATIVE_KERNEL capability);
  2. Accept function pointers as kernel arguments.

So, does anyone know of the existence of a OpenCL kernel SVD open-source implemetation, which can then be called within a parent OpenCL kernel?

I googled, and found several libraries/implementations of SVD for GPU offload, but I couldn't see how to "embed" them into an OpenCL kernel (they all seem implementations to be launched from host code). If I'm wrong, please correct me. Any help is more than welcome.

Upvotes: 0

Views: 163

Answers (1)

huseyin tugrul buyukisik
huseyin tugrul buyukisik

Reputation: 11910

Implement an event-callback API between host and kernel using only atomic functions such that:

void callExternalLib(__global int * ptr)
{
     atomic_inc(ptr,1);

     // if clWaitForEvents not supported in kernel
     while(atomic_inc(ptr,0) == 1)
     {
           // somehow wait until signal 0 is received
     }
     dynamicParallelismLaunchRestOfTheAlgorithm();
}

__kernel void test(__global int * communication, __global int * data)
{
   callExternalLib(communication);
}

// at the same time on host with a dedicated event-thread:
// if opencl-events do not work between gpu and host
while(ptr.load()==0)
{
   std::this_thread::yield();
}
if(ptr.load()==CALL_SVD)
{
    clMagmaCopyToGraphicsCard(); // not required if buffer handle can be shared
    clMagmaComputeOnGPU();
    clMagmaCopyToHost(); // not required if buffer handle can be shared
    copyToYourOpenCLBuffer(); // not required if buffer handle can be shared
    ptr--; // inform kernel's threads that clmagma function has been called
}

From https://man.opencl.org/atomic_store.html:

With fine-grained system SVM, sharing happens at the granularity of individual loads and stores anywhere in host memory. Memory consistency is always guaranteed at synchronization points, but to obtain finer control over consistency, the OpenCL atomics functions may be used to ensure that the updates to individual data values made by one unit of execution are visible to other execution units. In particular, when a host thread needs fine control over the consistency of memory that is shared with one or more OpenCL devices, it must use atomic and fence operations that are compatible with the C11 atomic operations.

I don't know if your graphics card / driver supports this. OpenCL 2.0 may not be fully supported by all GPUs.

To make host-side libraries run directly on GPU, you'll need to convert some parts by hand:

  • allocations
  • math functions' implementations like sqrt,cos,sin,exp
  • intrinsic functions (GPU can't run AVX maybe except Intel's XeonPhi?)
  • alignments of structs, arrays
  • dependencies to other libraries
  • maybe even calling-conventions? (some gpus don't have a real call stack)

Latency of just an atomically-triggered GPU-library call should be negligible if the work is heavy but it's not suitable when every clock-cycle is required on GPU-side. So it wouldn't be good for working with small matrices.

Upvotes: 0

Related Questions