Reputation: 688
I have initially work units with the size of 11*11*6779
. For the sake of simplicity I dont want to translate it into 1D global work size. When when i changed it into 21*21*6779
the performance is 5-6x slower than before. the code as far as i know has nothing to do with the number of threads being ran.
The amount of data transfered is only 4x bigger, which I dont think is a reason why the programm runs slower, because i tested the memory allocation process.
Note that my device has a max work items of 256*256*256
, meaning I would be use half of all available work items, and this is not a dedicated device (also used for display..).
I wonder if setting the work item sizes into 21*21*6779
uses too many of my work items, or the dimensions are simply inconvenient for openCL to adjust ?
Upvotes: 0
Views: 67
Reputation: 6343
If your max work items is 256x256x256 then why are you using 21x21x6779 (where 6779 is greater than 256)? Note that if the work group size is not specified, the runtime will try to pick one that can divide up your global work size. If your dimensions not easily divisible by the runtime, it might pick bad work group sizes. That could explain why the performance changes based on global work size. I recommend you specify the work group size, and make the global work size a multiple of that (if necessary, pass in the real size as parameters and in each work item check if it is in range; this is a typical pattern you will see a lot in OpenCL).
Upvotes: 1