OpenCL 'non-blocking' reads have higher cost than expected

Question

Consider the following code, which enqueues between 1 and 100000 'non-blocking' random access buffer reads and measures the time:

#define __CL_ENABLE_EXCEPTIONS

#include 
#include 
#include 
#include 

#include 

static const int size = 100000;
int host_buf[size];

int main() {
    cl::Context ctx(CL_DEVICE_TYPE_DEFAULT, nullptr, nullptr, nullptr);
    std::vector devices;
    ctx.getInfo(CL_CONTEXT_DEVICES, &devices);
    printf("Using OpenCL devices: 
");
    for (auto &dev : devices) {
        std::string dev_name = dev.getInfo();
        printf("        %s
", dev_name.c_str());
    }

    cl::CommandQueue queue(ctx);

    cl::Buffer gpu_buf(ctx, CL_MEM_READ_WRITE, sizeof(int) * size, nullptr, nullptr);

    std::vector values(size);

    // Warmup
    queue.enqueueReadBuffer(gpu_buf, false, 0, sizeof(int), &(host_buf[0]));
    queue.finish();

    // Run from 1 to 100000 sized chunks
    for (int k = 1; k <= size; k *= 10) {
        auto cstart = std::chrono::high_resolution_clock::now();
        for (int j = 0; j < k; j++)
            queue.enqueueReadBuffer(gpu_buf, false, sizeof(int) * (j * (size / k)), sizeof(int), &(host_buf[j]));
        queue.finish();
        auto cend = std::chrono::high_resolution_clock::now();
        double time = std::chrono::duration(cend - cstart).count() * 1000000.0;
        printf("%8d: %8.02f us
", k, time);
    }
    return 0;
}

As always, there is some random variation but the typical output for me is like this:

       1:    10.03 us
      10:   107.93 us
     100:   794.54 us
    1000:  8301.35 us
   10000: 83741.06 us
  100000: 981607.26 us

Whilst I did expect a relatively high latency for a single read, given the need for a PCIe round trip, I am surprised at the high cost of adding subsequent reads to the queue - as if there isn't really a 'queue' at all but each read adds the full latency penalty. This is on a GTX 960 with Linux and driver version 455.45.01.

Is this expected behavior?
Do other GPUs behave the same way?
Is there any workaround other than always doing random-access reads from inside a kernel?

Elad Maimoni · Accepted Answer

You are using a single in-order command queue. Hence, all enqueued reads are performed sequentially by the hardware / driver.

The 'non-blocking' aspect simply means that the call itself is asynchronous and will not block your host code while GPU is working. In your code, you use clFinish which blocks until all reads are done.

So yes, this is the expected behavior. You will pay the full time penalty for each DMA transfer.

As long as you create an in-order command queue (the default), other GPUs will behave the same.

If your hardware / driver support out-of-order queues, you could use them to potentially overlap DMA transfers. Alternatively you could use multiple in-order queues. But the performance is of-course hardware & driver dependent.

Using multiple queues / out-of-order queues is a bit more advanced. You should make sure you to properly utilize events to avoid race conditions or cause undefined behavior.

To reduce latency associated with GPU-Host DMA transfers, it is recommended you use a pinned host buffer rather then std::vector. Pinned host buffers are usually created via clCreateBuffer with the CL_MEM_ALLOC_HOST_PTR flag.

OpenCL 'non-blocking' reads have higher cost than expected

Answers (1)

Related Questions

OpenCL &#39;non-blocking&#39; reads have higher cost than expected

Answers (1)

Related Questions

OpenCL 'non-blocking' reads have higher cost than expected