OpenCL - Performance

Question

I'm working with OpenCL and I work with a matrix that I increase its values, and I need the application time to be as low as possible. What is the best way to improve performance with OpenCL? I've read something about data parallelism and task parallelism, but I do not know them very well.

I'm working with a 64x56 matrix. Using task parallelism I have create 64 kernels functions. One kernel for each column, but I think that I could do it much better.

lawful_neutral · Accepted Answer

If you are executing the kernel on GPU, it might be better to make one thread handle one item. However, it depends on what exactly you are doing with the elements of the matrix, e.g. how many operations you perform on each of them. If you just increase the elements by some numbers, it might not be beneficial.

In general, there are 3 options:

One thread works with the whole matrix. This way there is no parallelism, and it's bad for GPU.
One thread works with one row/column. -> 64/56 threads are used, global work size equals 64 or 56.
One thread works with a single element. -> 3584 threads are used, global work size is {64, 56}.

Have you tried using just one kernel, that handles one element, and call clEnqueueNDRangeKernel for it with the global work size equal {64, 56}? How does it affect the execution time?

OpenCL - Performance

Answers (1)

Related Questions