mmain
mmain

Reputation: 353

float4 - multiply-add - performance tips OpenCL

I am working with image processing application , gray-scale images only - the GPU occupancy is limited by the increase number of vector registers per workgroup and Local Memory per workgroup.

The read_imagef() function returns float4 , however my application works with only the first three components of the float4 - so there is an extra float operation per any computation (hence increases execution time).

nevertheless - the kernel perform many Multiply Add ops also on float4

How can I optimize this kernel so that it uses less vector-registers and if there is are tips-tricks to increase the MAD ops speed (knowing that i have tried the hardware supported function and the performance went down).

Upvotes: 1

Views: 1817

Answers (3)

kangshiyin
kangshiyin

Reputation: 9781

If you work with gray-scale images only, you could implement you own 'read_imagef()' which read only one channel of the images, so that everything you deal with is float.


As your data may be interleaved in memory as RGBRGB.... Loading only R channel probably cost same time as loading all channels. It is array of struct situation. You could find more details here.

Structure of Arrays vs Array of Structures in cuda

Given the data layout, you could load float4/float3, extract one channel of float from it and then do computation on the extracted float.

kernel perform many Multiply Add ops also on float4

I don't get why your kernel has to do those ops on float4. Maybe you want to show some code the demonstrate that.

Upvotes: 2

huseyin tugrul buyukisik
huseyin tugrul buyukisik

Reputation: 11920

If it returns float4 and if it does that within same number of inernal memory operations as float3, then it would be same latency. A mad operation is much shorter latency than a memory operation.

There is no float3 hardware as I know, so you can compute 3 elements one by one if it is a scalar micro architecture (such as a new gpu). If it is vliw-4 then it will use 4th element at the same time or not, it will have same speed.

Upvotes: 1

Mats Petersson
Mats Petersson

Reputation: 129374

It's hard to be specific here, since it depends on what hardware it is and what you are doing to the image content - you may be better off dealing with the image as a plain buffer of bytes (with your own conversion to float, but that risks adding more CL code, and reduces the use of the texture unit in the GPU that does the conversion for you, assuming there is HW for this).

Another option is to do four readimage_f calls, and "merge" the values into one float4, do your math, and split the result.

Unfortunately, although OpenCL is "portable", it's not "portable with known performance", so what works well in one OpenCL implementation, may not work in another, and tweaking/tuning the algorithm for performance requires good understanding of the architecture as a whole.

Upvotes: 0

Related Questions