Reputation: 203
Suppose we have two CUDA streams running two CUDA kernels on a GPU at the same time. How can I pause the CUDA kernel running with the instruction I putting in the host code and resume it with the instruction in the host code? I have no idea how to write a sample code in this case, for example, to continue this question.
Exactly my question is whether there is an instruction in CUDA that can pause a CUDA kernel running in a CUDA stream and then resume it?
Upvotes: 2
Views: 873
Reputation: 11910
You can use dynamic parallelism with parameters for communication with host for the signals. Then launch a parent kernel with only 1 cuda thread and let it launch child kernels continuously until work is done or signal is received. If child kernel does not fully occupy the GPU, then it will lose performance.
__global__ void parent(int * atomicSignalPause, int * atomicSignalExit, Parameters * prm)
{
int progress = 0;
while(checkSignalExit(atomicSignalExit) && progress<100)
{
while(checkSignalPause(atomicSignalPause))
{
child<<<X,Y>>>(prm,progress++);
cudaDeviceSynchronize();
}
}
}
There is no command to pause a stream. For multiple GPUs, you should use unified memory allocation for the communication (between GPUs).
To overcome the gpu utilization issue, you may invent a task queue for child kernels. It pushes work N times (roughly enough to keep GPU efficient in power/compute), then for every completed child kernel it increments a dedicated counter in the parent kernel and pushes a new work, until all work is complete (while trying to keep concurrent kernels at N).
Maybe something like this:
// producer kernel
// N: number of works that make gpu fully utilized
while(hasWork)
{
// concurrency is a global parameter
while(checkConcurrencyAtomic(concurrency)<N)
{
incrementConcurrencyAtomic(concurrency);
// a "consumer" parent kernel will get items from queue
// it will decrement concurrency when a work is done
bool success = myQueue.tryPush(work, concurrency);
if(success)
{
// update status of whole work or signal the host
}
}
// synchronization once per ~N work
cudaDeviceSynchronize();
... then check for pause signals and other tasks
}
If total work takes more than a few seconds, these atomic value updates shouldn't be a performance problem but if you have way too many child kernels to launch then you can launch more producer/consumer (parent) cuda-threads.
Upvotes: -1