mneumann
mneumann

Reputation: 786

Setting argument for kernel extremely slow (OpenCL)

In my OpenCL Dijkstra's algorithm implementation, the slowest part by far is writing the 1D reduced graph matrix to the kernel argument, which is global memory.

My graph is a two dimensional array; for OpenCL it gets reduced to a 1D array like so:

for (int q = 0; q < numberOfVertices; q++)
{
    for (int t = 0; t < numberOfVertices; t++)
    {
        reducedGraph[q * numberOfVertices + t] = graph[q][t];
    }
}

Put into a buffer:

cl::Buffer graphBuffer = cl::Buffer(context, CL_MEM_READ_WRITE, numberOfVertices * numberOfVertices * sizeof(int));

Setting the argument then takes an extremely long time. For my test with 5,760,000 vertices, writing the data to the argument takes more than 3 seconds while the algorithm itself takes less than a millisecond:

kernel_dijkstra.setArg(5, graphBuffer);

The kernel uses the graph as a global argument:

void kernel min_distance(global int* dist, global bool* verticesSet, const int sizeOfChunks, global int* result, const int huge_int, global int* graph, const int numberOfVertices)

Is there any way to speed this up? Thank you!

Edit: My Kernel's code:

// Kernel source, calculates minimum distance in segment and relaxes graph.
std::string kernel_code =
       void kernel min_distance(global int* dist, global bool* verticesSet, const int sizeOfChunks, global int* result, const int huge_int, global int* graph, const int numberOfVertices) {
           for (int b = 0; b < numberOfVertices; b++) {
               int gid = get_global_id(0);
               int min = huge_int, min_index = -1;
               for (int v = gid * sizeOfChunks; v < sizeOfChunks * gid + sizeOfChunks; v++) {
                   if (verticesSet[v] == false && dist[v] < min && dist[v] != 0) {
                       min = dist[v];
                       min_index = v;
                    }
               }
               result[gid] = min_index;
               if (gid != 0) continue;
               min = huge_int;
               min_index = -1;
               int current_min;
               for (int a = 0; a < numberOfVertices; a++) {
                   current_min = dist[result[a]];
                   if (current_min < min && current_min != -1 && current_min != 0) { min = current_min; min_index = result[a]; }
               }
               verticesSet[min_index] = true;
     // relax graph with found global min.
               int a = 0;
               int min_dist = dist[min_index];
               int current_dist;
               int compare_dist;
               for (int i = min_index * numberOfVertices; i < min_index * numberOfVertices + numberOfVertices; i++) {
                   current_dist = dist[a];
                   compare_dist = graph[min_index * numberOfVertices + a];
                   if (current_dist > min_dist + compare_dist && !verticesSet[a] && compare_dist != 0) {
                       dist[a] = min_dist + compare_dist;
                   }
                   a++;
               }
           }
       };

How I enqueue it:

    numberOfComputeUnits = default_device.getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>();
queue.enqueueNDRangeKernel(kernel_dijkstra, 0, cl::NDRange(numberOfVertices), numberOfComputeUnits);

Upvotes: 0

Views: 254

Answers (1)

ProjectPhysX
ProjectPhysX

Reputation: 5746

The error here is that your memory allocation is way too large: 5.76M vertices need a 133TB buffer because the buffer size is quadratic in vertex number. Neither the C++ compiler nor OpenCL will report this as an error and even your kernel will appearently start and run just fine, but in reality it does not compute anything because memory is not enough, and you will get random and undefined results.

Generally .setArg(...) should not take longer than a few milliseconds. Also it is beneficial to do the initialization part (containing buffer allocation, .setArg(...) etc.) only once in the beginning and then repeatedly run the kernel or exchange data in the buffers without reallocation.

Upvotes: 2

Related Questions