James Sweet
James Sweet

Reputation: 129

OpenCL clEnqueueTasks Parallelism

I am trying to write some code that does AES Decryption. I have the code working but I wanted to be able to add Cipher Block Chaining which requires that I do an XOR operation after the decryption.

To make the code easier to write and understand I wrote the code using two kernels. One that does the decryption on a single block and one that does the XOR for the CBC part. I then submitted these to the queue via clEnqueueTask for each 16byte block of data with the dependency specified by an event between the Decryption and XOR.

This turns out to be very slow, it works in the fact that it does them in the correct order however it does not seem to be parallelizing the execution.

Does anyone know why or how to improve the performance without losing the granularity?

Upvotes: 1

Views: 1359

Answers (2)

tbalazs
tbalazs

Reputation: 599

The kernel executed via clEnqueueTask is essentially single-threaded, which means the global work-size is 1 and the task occupy a whole compute unit for this single thread. This can be a great impact performance wise, because on a typical GPU you can execute 8-16 tasks/work-groups parallel (CL_DEVICE_MAX_COMPUTE_UNITS), and in a work-group you can execute 256-1024 (CL_DEVICE_MAX_WORK_GROUP_SIZE). So in your case you can achieve 8-16x parallelism instead of the theoretical maximum 15000x because you cannot utilize the whole hardware.

Upvotes: 0

prunge
prunge

Reputation: 23248

clEnqueueTask is typically used for single-threaded tasks.

If your kernels can execute in parallel, use one clEnqueueNDRangeKernel call instead of lots of clEnqueueTask calls with different parameters.

Something else that might prevent good parallel performance is lots of global memory access. If you are doing lots of reads/writes of global memory in your kernel compared to the amount of computation, that might be slowing you down depending on your hardware.

Upvotes: 1

Related Questions