gatecat
gatecat

Reputation: 1186

OpenCL 'non-blocking' reads have higher cost than expected

Consider the following code, which enqueues between 1 and 100000 'non-blocking' random access buffer reads and measures the time:

#define __CL_ENABLE_EXCEPTIONS

#include <CL/cl.hpp>
#include <vector>
#include <iostream>
#include <chrono>

#include <stdio.h>

static const int size = 100000;
int host_buf[size];

int main() {
    cl::Context ctx(CL_DEVICE_TYPE_DEFAULT, nullptr, nullptr, nullptr);
    std::vector<cl::Device> devices;
    ctx.getInfo(CL_CONTEXT_DEVICES, &devices);
    printf("Using OpenCL devices: \n");
    for (auto &dev : devices) {
        std::string dev_name = dev.getInfo<CL_DEVICE_NAME>();
        printf("        %s\n", dev_name.c_str());
    }

    cl::CommandQueue queue(ctx);

    cl::Buffer gpu_buf(ctx, CL_MEM_READ_WRITE, sizeof(int) * size, nullptr, nullptr);

    std::vector<int> values(size);

    // Warmup
    queue.enqueueReadBuffer(gpu_buf, false, 0, sizeof(int), &(host_buf[0]));
    queue.finish();

    // Run from 1 to 100000 sized chunks
    for (int k = 1; k <= size; k *= 10) {
        auto cstart = std::chrono::high_resolution_clock::now();
        for (int j = 0; j < k; j++)
            queue.enqueueReadBuffer(gpu_buf, false, sizeof(int) * (j * (size / k)), sizeof(int), &(host_buf[j]));
        queue.finish();
        auto cend = std::chrono::high_resolution_clock::now();
        double time = std::chrono::duration<double>(cend - cstart).count() * 1000000.0;
        printf("%8d: %8.02f us\n", k, time);
    }
    return 0;
}

As always, there is some random variation but the typical output for me is like this:

       1:    10.03 us
      10:   107.93 us
     100:   794.54 us
    1000:  8301.35 us
   10000: 83741.06 us
  100000: 981607.26 us

Whilst I did expect a relatively high latency for a single read, given the need for a PCIe round trip, I am surprised at the high cost of adding subsequent reads to the queue - as if there isn't really a 'queue' at all but each read adds the full latency penalty. This is on a GTX 960 with Linux and driver version 455.45.01.

Upvotes: 0

Views: 134

Answers (1)

Elad Maimoni
Elad Maimoni

Reputation: 4575

You are using a single in-order command queue. Hence, all enqueued reads are performed sequentially by the hardware / driver.

The 'non-blocking' aspect simply means that the call itself is asynchronous and will not block your host code while GPU is working. In your code, you use clFinish which blocks until all reads are done.

So yes, this is the expected behavior. You will pay the full time penalty for each DMA transfer.

As long as you create an in-order command queue (the default), other GPUs will behave the same.

If your hardware / driver support out-of-order queues, you could use them to potentially overlap DMA transfers. Alternatively you could use multiple in-order queues. But the performance is of-course hardware & driver dependent.

Using multiple queues / out-of-order queues is a bit more advanced. You should make sure you to properly utilize events to avoid race conditions or cause undefined behavior.

To reduce latency associated with GPU-Host DMA transfers, it is recommended you use a pinned host buffer rather then std::vector. Pinned host buffers are usually created via clCreateBuffer with the CL_MEM_ALLOC_HOST_PTR flag.

Upvotes: 3

Related Questions