How to determine AMD OpenCl CPU Device and GPU Devices are running parallel?

Question

I have four jobs (grouped in two tasks) need to be executed on AMD OpenCl device and GPU Device parallely. As per my knowledge, calling NDRangeKernel for AMD OpenCl CPU Device, is returning promptly (non blocking) if NULL event is passed.

TASK1 Hence, firstly i am calling NDRangeKernel for AMD OpenCl CPU Device with NDRangeKernel for job1, after which host will have the control promptly.

ret = clEnqueueNDRangeKernel(command_queue_amd, icm_kernel_amd, 1, NULL, &glob, &local, 0, NULL, NULL);

TASK2 Then host can call NDRangeKernel for GPU Device using gpu kernel 1 for job2 and then for gpu kernel 2 for job3 and then gpu kernel 3 for job4 which will call them serially.

ret = clEnqueueNDRangeKernel(command_queue_gpu, icm_kernel_gpu[0], 1, NULL, &glob, &local, 0, NULL, NULL);
ret = clEnqueueNDRangeKernel(command_queue_gpu, icm_kernel_gpu[1], 1, NULL, &glob, &local, 0, NULL, NULL);
ret = clEnqueueNDRangeKernel(command_queue_gpu, icm_kernel_gpu[2], 1, NULL, &glob, &local, 0, NULL, NULL);

They are not returning promptly to host.

And then reading buffer for GPU and then for CPU.

ret = clEnqueueReadBuffer(command_queue_gpu, Buffer_gpu, CL_TRUE, 0, count * sizeof(double), arr_gpu, 0, NULL, NULL);
ret = clEnqueueReadBuffer(command_queue_amd, Buffer_amd, CL_TRUE, 0, count * sizeof(double), arr_cpu, 0, NULL, NULL);

My question is, is both the tasks are running parallely? Is there any profiler/logic to detect such behaviour? Any commments/logics/pointers will be appreciated.

DarkZeros · Accepted Answer

Let me write a proper answer:

The parallel execution of the kernels depends on the device/queue model used. In the general "spec" point of view:

A queue runs the jobs in order (no overlap or parallel execution possible). Unless the queue has the property "out-of-order-exec", in which case parallel execution and disordered execution is possible, and everything will be controlled based on events.

But from the HW point of view: (nVIDIA, AMD, etc)

A device can only run one kernel at a time. Therefore, if a queue can only act to a device, a queue can't process kernels in parallel.

In a multi-device setting, this constrain is relaxed, and the kernels can run in parallel in different devices. But in order to be able to run fully parallel there are some rules to meet:

The chain for each device should be completely separate. Kernel, Queue, memory, etc (context can be the same).
If the memory has to be shared, it is recommended to have a proper fine control of it. Kernels writing to the same memory can lead to blocking of one of them.
In case that a kernel uses as input the output of another kernel, the execution will not be parallel.

In order to measure if "in fact" parallel execution is working I recommend to use events.

You can do it the hard way (manually) or you can use CodeXL, nSight or Intel SDK. They will collect this metrics for you by hooking the OpenCL calls, and give you the insight you need (in a very convenient format with figures and statistics).

How to determine AMD OpenCl CPU Device and GPU Devices are running parallel?

Answers (2)

Related Questions