ixeption
ixeption

Reputation: 2060

Reordering of work dimensions may cause a huge performance boost, but why?

I am using OpenCL for stereo image processing on the GPU and after I ported a C++ implementation to OpenCL i was playing around with optimizations. A very simple experiment was to swap around the dimensions.

Consider a simple kernel, which is executed for every pixel along the two dimensional work space (f.e. 640x480). In my case it was a Census transform.

Swapping from:

int globalU = get_global_id(0);
int globalV = get_global_id(1);

too

int globalU = get_global_id(1);
int globalV = get_global_id(0);

while adjusting the NDRange in the same way, gave a performance boost about 500%. Other experiments in 3d Space achieved a execution time from 72ms to 2ms, only with reordering of the dimensions.

Can anybody explain my, how this happens? Is it just an effect of memory pipelines and cache usage?

EDIT: The image has a standart mamory layout. Thats why i wondered about the effects. I expected the best speed, when the iteration goes like the image is stored in the memory, which is not the case.

After some reading of the AMD APP SDK documantation, i found some interesting details about the memory channels. That could be a reason.

Upvotes: 2

Views: 127

Answers (1)

Ivan Mushketyk
Ivan Mushketyk

Reputation: 8295

When you access an element in a memory in first being loaded into a CPU's cache. CPU does not load single element (say 1 byte), but instead it loads a single line (for example 64 adjacent bytes) into cache. This is because it is usually likely that you will access subsequent elements, so CPU would not need to access RAM again.

This is make a huge difference since to access cache memory, an electrical signal should not even leave CPU chip, while if CPU need to load data from RAM, the signal need to travel to a separate chip and probably more then one signal from CPU is required since it generally need to specify row and column in RAM to access part of it (Read What every programmer should know about memory for more info). In practice access time to cache may take only 0.5 ns while RAM access will cost 100 ns.

So computer algorithms should take this into account. If you traverse through all elements in a matrix, you should traverse them so you would access elements that are located near each other approximately at the same time. So if your matrix has the following layout in memory:

m_0_0, m_0_1, m_0_2, ... m_1_0, m_1_1 (first column, second column, etc.)

you should access elements in order: m_0_0, m_0_1, m_0_2 (by column)

If you would use different access order (say by row in this case), CPU would load part of first column in cache when you access first element in first column, then part of second column when you access first element in second column etc. When you will traverse first row and access second element in first column cache values for the first column would no longer be present in cache, since it has limited (and very small) size. Therefore such an algorithm would effectively eliminate cache at all.

Upvotes: 1

Related Questions