Reputation: 373
I'm studying how to offload some quite heavy calculations on GPUs.
Although on my machine I have a NVIDIA RTX GPU
, I would like to avoid using
CUDA in order to develop something portable on other GPUs as well (at least in its core).
Thus the choice of OpenCL.
Now, my current biggest concern is that, within the core that is suitable for offload I intensively make use of LAPACK SVD implementation. However, in OpenCL, kernel code cannot either:
clEnqueueNativeKernel()
, but this does not seem to apply in this case (call within a kernel itself) (not to mention this is not very portable, since it is needed the device to support CL_EXEC_NATIVE_KERNEL
capability);So, does anyone know of the existence of a OpenCL kernel SVD open-source implemetation, which can then be called within a parent OpenCL kernel?
I googled, and found several libraries/implementations of SVD for GPU offload, but I couldn't see how to "embed" them into an OpenCL kernel (they all seem implementations to be launched from host code). If I'm wrong, please correct me. Any help is more than welcome.
Upvotes: 0
Views: 163
Reputation: 11910
Implement an event-callback API between host and kernel using only atomic functions such that:
void callExternalLib(__global int * ptr)
{
atomic_inc(ptr,1);
// if clWaitForEvents not supported in kernel
while(atomic_inc(ptr,0) == 1)
{
// somehow wait until signal 0 is received
}
dynamicParallelismLaunchRestOfTheAlgorithm();
}
__kernel void test(__global int * communication, __global int * data)
{
callExternalLib(communication);
}
// at the same time on host with a dedicated event-thread:
// if opencl-events do not work between gpu and host
while(ptr.load()==0)
{
std::this_thread::yield();
}
if(ptr.load()==CALL_SVD)
{
clMagmaCopyToGraphicsCard(); // not required if buffer handle can be shared
clMagmaComputeOnGPU();
clMagmaCopyToHost(); // not required if buffer handle can be shared
copyToYourOpenCLBuffer(); // not required if buffer handle can be shared
ptr--; // inform kernel's threads that clmagma function has been called
}
From https://man.opencl.org/atomic_store.html:
With fine-grained system SVM, sharing happens at the granularity of individual loads and stores anywhere in host memory. Memory consistency is always guaranteed at synchronization points, but to obtain finer control over consistency, the OpenCL atomics functions may be used to ensure that the updates to individual data values made by one unit of execution are visible to other execution units. In particular, when a host thread needs fine control over the consistency of memory that is shared with one or more OpenCL devices, it must use atomic and fence operations that are compatible with the C11 atomic operations.
I don't know if your graphics card / driver supports this. OpenCL 2.0 may not be fully supported by all GPUs.
To make host-side libraries run directly on GPU, you'll need to convert some parts by hand:
Latency of just an atomically-triggered GPU-library call should be negligible if the work is heavy but it's not suitable when every clock-cycle is required on GPU-side. So it wouldn't be good for working with small matrices.
Upvotes: 0