BlueWanderer
BlueWanderer

Reputation: 2691

AMD OpenCL asynchronous execution efficency

For example, I have three tasks A, B, and C. Among them B and C depends on A. And there are sufficent CU's to run B and C at the same time. And then I enqueue A and C on queue0, and B on queue1. And there is a huge delay after A is finished and before B is started, which make the whole job taking longer time than using only one queue.

Is this normal? Or could I have done something wrong?

I will write a sample code if required, the original code is heavily encapsuled. But actually I just create an event when enqueuing A and pass it to the enqueuing of B, and both queues are just normal in order queue. Nothing seems to be special.

Upvotes: 2

Views: 373

Answers (1)

huseyin tugrul buyukisik
huseyin tugrul buyukisik

Reputation: 11910

I couldn't find info about latencies but, to call something normal, we need statistically derived latency base for all platforms, here is mine:

HD7870 and R7-240 showing same behaviour. Windows 10. Two channel RAM. OpenCl 1.2(64 bit build). CodeXL profiling. All in-order queues. Some old drivers before crimson.

  • eventless single queue with non-blocking commands: several microseconds to 200 microseconds fluctuating but average must be low like 50 microseconds and depending on drivers, for some kernels it goes to 500 microseconds maybe because of too many parameters and similar preparations.
  • event source = single queue-A, event target = queue-B: 100-150 microseconds to half a millisecond(seemed constant)
  • event source = N-1 queues list, event target = queue-N: Not sum of all latencies of queues but some kind of latency hidden is there, so its not more than 2 millisecond(sometimes peaks to 3-5 milliseconds rarely)
  • event source = queue, waiting by clWaitForEvents from host: about a millisecond
  • event source = queue, waiting by clGetEventInfo from host in while-loop: nearly half a millisecond, sometimes even less
  • clFinish for single queue: This has most latency per queue like 1ms at least.
  • user events: were generating errors in codeXL so I couldn't query their performance but it was an older driver and older codeXL version.

There were background processes: avira, google chrome,.. which are advanced enough to use GPU for their purpose and may hinder kernel executions.

My solution to these were pipelining through usage of many independent queues to hide their event latencies and worked like a charm. R7-240 was running on 16-queues fine. It has only 2 ACE units so newer cards having 4-8 of them could work with more queues.

What I didn't try and wonder is: N queue waiting for completion M other queues with event list performance. Maybe tree-like waiting structure could be better for many queues if they lag too much.

Upvotes: 1

Related Questions