Batko
Batko

Reputation: 33

Open CL Running parallel tasks on data parallel kernel

I'm currently reading up on the OpenCL framework because of reasons regarding my thesis work. And what I've come across so far is that you can either run kernels in data parallel or in task parallel. Now I've got a question and I can't manage to find the answer.

Q: Say that you have a vector that you want to sum up. You can do that in OpenCL by writing a kernel for a data parallel process and just run it. Fairly simple.

However, now say that you have 10+ different vectors that need to be summed up also. Is it possible to run these 10+ different vectors in task parallel, while still using a kernel that processes them as "data parallel"?

So you basically parallelize tasks, which in a sense are run in parallel? Because what I've come to understand is that you can EITHER run the tasks parallel, or just run one task itself in parallel.

Upvotes: 1

Views: 332

Answers (2)

Lee
Lee

Reputation: 930

The whole task-parallel/data-parallel distinction in OpenCL was a mistake. We deprecated clEnqueueTask in OpenCL 2.0 because it had no meaning.

All enqueued entities in OpenCL can be viewed as tasks. Those tasks may be run concurrently, they may be run in parallel, they may be serialized. You may need multiple queues to run them concurrently, or a single out-of-order queue, this is all implementation-defined to be fully flexible.

Those tasks may be data-parallel, if they are made of multiple work-items working on different data elements within the same task. They may not be, consisting of only one work-item. This last definition is what clEnqueueTask used to provide - however, because it had no meaning whatsoever compared with clEnqueueNDRangeKernel with a global size of (1,1,1), and it was not checked against anything in the kernel code, deprecating it was the safer option.

So yes, if you enqueue multiple NDRanges, you can have multiple tasks in parallel, each one of which is data-parallel.

You can also copy all of those vectors at once inside one data-parallel kernel, if you are careful with the way you pass them in. One option would be to launch a range of work-groups, each one iterates through a single vector copying it (that might well be the fastest way on a CPU for cache prefetching reasons). You could have each work-item copy one element using some complex lookup to see which vector to copy from, but that would likely have high overhead. Or you can just launch multiple parallel kernels, each for one kernel, and have the runtime decide if it can run them together.

Upvotes: 1

Dithermaster
Dithermaster

Reputation: 6333

If your 10+ different vectors are close to the same size, it becomes a data parallel problem.

The task parallel nature of OpenCL is more suited for CPU implementations. GPUs are more suited for data parallel work. Some high-end GPUs can have a handful of kernels in-flight at once, but their real efficiency is in large data parallel jobs.

Upvotes: 0

Related Questions