Reputation: 163
If I have multithread application and my own thread that control CUDA device and schedule kernels to different streams I can achieve very high GPU usage also on devices prior to Kepler-2 (GK110) famaly like Fermi and Kepler-1 (GK104).
So I don't see good reason aspire to more expensive cards.
flowing my test example and profiling on Fermi card:
void ut_concurent_kernels()
{
int i,j;
cudaEvent_t kernelEvent;
cudaStream_t work_stream[14];
for (i = 0; i < 14;i++)
{
cudaStreamCreate( &work_stream[i]);
}
cudaEventCreateWithFlags(&kernelEvent, cudaEventDisableTiming);
for (j = 0; j < 2;j++)
{
for (i = 0; i < 14;i++)
{
if (i == 13)
{
checkCudaErrors(cudaEventRecord(kernelEvent, work_stream[i]));
}
Kernel_Work<<<1,256,0,work_stream[i]>>>(100000);
}
checkCudaErrors(cudaStreamWaitEvent(work_stream[i-1], kernelEvent,0));
}
cudaDeviceSynchronize();
for (i = 0; i < 14;i++)
{
cudaStreamDestroy(work_stream[i]);
}
cudaEventDestroy(kernelEvent);
}
Upvotes: 0
Views: 1317
Reputation: 151849
I believe the most programmer-visible manifestation of Hyper-Q is a linux feature called CUDA Multi-Process-Server (MPS). The principal use-case for CUDA MPS (on top of Hyper-Q) at this time is to enable MPI cluster compute jobs to share GPUs amongst several MPI ranks, for MPI ranks executing on the same node, as discussed here. In this case, multiple MPI ranks normally operate out of independent CPU processes (as opposed to threads) and CUDA MPS/Hyper-Q provides a convenient sharing mechanism.
As already noted, Hyper-Q is a hardware feature that is available in cc3.5 devices, but not in cc 1.x - 3.0 devices. In the context of a single application process, whether single- or multi-threaded, I know of no advantages or compelling use cases that CUDA MPS(on top of Hyper-Q) provides over managing multiple concurrent execution yourself.
With respect to managing multiple concurrent execution yourself, Hyper-Q will (transparently) provide benefits to the programmer as described by @Greg Smith in his comments below. There are no specific requirements to take advantage of it in a cc3.5 device, and conceptually, the user will manage concurrent execution using the same overall methodology, but may experience improved concurrency, or the ability to achieve improved concurrency, in a cc3.5 device, due to the hardware advances in Hyper-Q. As Greg suggests and OP has demonstrated, it is still possible to achieve concurrency in pre-cc3.5 devices, but the opportunities to do so may be more limited, or may require less than obvious programming techniques. To get a more detailed view of these issues, refer to slides 15-22 here
Upvotes: 2