MaiaVictor
MaiaVictor

Reputation: 53017

Should calls to kernels be encoded to fit the "GPU layout", or the "algorithm layout"?

I'm just learning about CUDA/OpenCL, but I have a conceptual question. Suppose I am encoding an algorithm to do a Breadth-First-Search on a graph. Suppose my target device is a GPU with only 2 work-groups of 2 processing elements (i.e., 4 cores). Intuitivelly, I guess the search could be done in parallel by keeping an array of "nodes to visit". For each pass, each node is visited in parallel and added to the next "nodes to visit" array. For example, this graph could spawn the following search:

start: must visit A
parallel pass 1: must visit B, C, D
parallel pass 2: must visit E, F, G, H, I, J, K, L, M
parallel pass 3: must visit N, O, P
parallel pass 4: must visit Q
parallel pass 5: done

A way to do it, I suppose, would be to call a ND range kernel 5 times (example in OpenCL):

clEnqueueNDRangeKernel(queue, kernel, 1, 0, 1, 1, 0, NULL, NULL);
clEnqueueNDRangeKernel(queue, kernel, 1, 0, 1, 3, 0, NULL, NULL);
clEnqueueNDRangeKernel(queue, kernel, 1, 0, 1, 9, 0, NULL, NULL);
clEnqueueNDRangeKernel(queue, kernel, 1, 0, 1, 3, 0, NULL, NULL);
clEnqueueNDRangeKernel(queue, kernel, 1, 0, 1, 1, 0, NULL, NULL);

(Of course, this is hard-coded for this case, so, to avoid that, I guess I could keep a counter of nodes to visit.) This sounds wrong, though: it doesn't fit the layout of a GPU, at all. For example, on the third call, you are using 1 work group of 9 work items, when your GPU has 2 work groups of 2 work items...!

Another way I see could be the following (example in OpenCL):

while (work not complete...)
    clEnqueueNDRangeKernel(queue, kernel, 1, 0, 2, 2, 0, NULL, NULL);

This would call "clEnqueueNDRangeKernel" continously in a way that fits perfectly the GPU layout, until it receives a signal that the work is done. But this time, the ids received by the kernel wouldn't fit the layout of the algorithm!

What is the right way to do this? Am I missing something?

Upvotes: 0

Views: 132

Answers (2)

CaptainObvious
CaptainObvious

Reputation: 2565

From what I can infer from your question, I'd say that the short answer is both.

I say that because the way you want to solve a problem is always linked to how you posed that problem.
The best illustration is the amount of sorting algorithms that exist. From the pure sorting point of view ("I need my data sorted"), the problem is solved for a while. However it didn't stop researchers to investigate new algorithms because they added constraints and/or new input to the problem: "I need my data sorted as fast as possible" (the reason why algorithms are categorized with the big O notation), "I need my data sorted as fast as possible knowing that the data structure is..." or "knowing that there is X % chance that..." or "I don't care about the speed but I care about memory", etc.

Now your problem seems to be: I want a breadth first search algorithm that runs efficiently (this I'm guessing - why to learn OCL/CUDA otherwise?) on GPUs.
This simple sentence hides a lot of constraints. For instance you have to take into account that:

  • it takes a lot of time to send the data through the PCIe bus (for discrete GPUs).
  • Access (global) memory latency is high.
  • Threads work in lock steps (number varies with vendors).
  • Latency is hidden with throughput, and so on.

Note also that it is not necessarily the same that "I want a parallel breadth first search algorithm" (which could run on CPU with again different constraints).
A quick Google search with these key words : "breadth first search parallel GPU" returns me, among others these articles that seems promising (I just went through the abstracts):

Upvotes: 1

Roman Arzumanyan
Roman Arzumanyan

Reputation: 1814

Your question is one of the most unobvious & interesting.

IMO, you should implement "pure" & easy-to-analyze algorithm if want your application to run on different hardware. If someone else will come across your implementation, at least, it will be easy to tweak.

Otherwise, if you put performance first, hardcode every single piece of software to achieve optimal performance on single target platform. Other man, who may work with your code later, will have to learn hardware peculiarities anyway.

Upvotes: 1

Related Questions