Reordering of work dimensions may cause a huge performance boost, but why?

Question

I am using OpenCL for stereo image processing on the GPU and after I ported a C++ implementation to OpenCL i was playing around with optimizations. A very simple experiment was to swap around the dimensions.

Consider a simple kernel, which is executed for every pixel along the two dimensional work space (f.e. 640x480). In my case it was a Census transform.

Swapping from:

int globalU = get_global_id(0);
int globalV = get_global_id(1);

too

int globalU = get_global_id(1);
int globalV = get_global_id(0);

while adjusting the NDRange in the same way, gave a performance boost about 500%. Other experiments in 3d Space achieved a execution time from 72ms to 2ms, only with reordering of the dimensions.

Can anybody explain my, how this happens? Is it just an effect of memory pipelines and cache usage?

EDIT: The image has a standart mamory layout. Thats why i wondered about the effects. I expected the best speed, when the iteration goes like the image is stored in the memory, which is not the case.

After some reading of the AMD APP SDK documantation, i found some interesting details about the memory channels. That could be a reason.

Reordering of work dimensions may cause a huge performance boost, but why?

Answers (1)

Related Questions