Reputation: 1186
Consider the following code, which enqueues between 1 and 100000 'non-blocking' random access buffer reads and measures the time:
#define __CL_ENABLE_EXCEPTIONS
#include <CL/cl.hpp>
#include <vector>
#include <iostream>
#include <chrono>
#include <stdio.h>
static const int size = 100000;
int host_buf[size];
int main() {
cl::Context ctx(CL_DEVICE_TYPE_DEFAULT, nullptr, nullptr, nullptr);
std::vector<cl::Device> devices;
ctx.getInfo(CL_CONTEXT_DEVICES, &devices);
printf("Using OpenCL devices: \n");
for (auto &dev : devices) {
std::string dev_name = dev.getInfo<CL_DEVICE_NAME>();
printf(" %s\n", dev_name.c_str());
}
cl::CommandQueue queue(ctx);
cl::Buffer gpu_buf(ctx, CL_MEM_READ_WRITE, sizeof(int) * size, nullptr, nullptr);
std::vector<int> values(size);
// Warmup
queue.enqueueReadBuffer(gpu_buf, false, 0, sizeof(int), &(host_buf[0]));
queue.finish();
// Run from 1 to 100000 sized chunks
for (int k = 1; k <= size; k *= 10) {
auto cstart = std::chrono::high_resolution_clock::now();
for (int j = 0; j < k; j++)
queue.enqueueReadBuffer(gpu_buf, false, sizeof(int) * (j * (size / k)), sizeof(int), &(host_buf[j]));
queue.finish();
auto cend = std::chrono::high_resolution_clock::now();
double time = std::chrono::duration<double>(cend - cstart).count() * 1000000.0;
printf("%8d: %8.02f us\n", k, time);
}
return 0;
}
As always, there is some random variation but the typical output for me is like this:
1: 10.03 us
10: 107.93 us
100: 794.54 us
1000: 8301.35 us
10000: 83741.06 us
100000: 981607.26 us
Whilst I did expect a relatively high latency for a single read, given the need for a PCIe round trip, I am surprised at the high cost of adding subsequent reads to the queue - as if there isn't really a 'queue' at all but each read adds the full latency penalty. This is on a GTX 960 with Linux and driver version 455.45.01.
Upvotes: 0
Views: 134
Reputation: 4575
You are using a single in-order command queue. Hence, all enqueued reads are performed sequentially by the hardware / driver.
The 'non-blocking' aspect simply means that the call itself is asynchronous and will not block your host code while GPU is working.
In your code, you use clFinish
which blocks until all reads are done.
So yes, this is the expected behavior. You will pay the full time penalty for each DMA transfer.
As long as you create an in-order command queue (the default), other GPUs will behave the same.
If your hardware / driver support out-of-order queues, you could use them to potentially overlap DMA transfers. Alternatively you could use multiple in-order queues. But the performance is of-course hardware & driver dependent.
Using multiple queues / out-of-order queues is a bit more advanced. You should make sure you to properly utilize events to avoid race conditions or cause undefined behavior.
To reduce latency associated with GPU-Host DMA transfers, it is recommended you use a pinned host buffer rather then std::vector
. Pinned host buffers are usually created via clCreateBuffer
with the CL_MEM_ALLOC_HOST_PTR
flag.
Upvotes: 3