Portability and Optimization of OpenCL between Radeon Graphic Cards

Question

I'm planning on diving into OpenCL and have been reading (only surface knowledge) on what OpenCL can do, but have a few questions.

Let's say I have an AMD Radeon 7750 and I have another computer that has an AMD Radeon 5870 and no plans on using a computer with an Nvidia card. I heard that optimizing the code for a particular device brings performance benefits. What exactly does optimizing mean? From what I've read and a little bit of guessing, it sounds like it means writing the code in a way that a GPU likes (in general without concern that it's an AMD or Nvidia card) as well as in a way that matches how the graphics card handles memory (I'm guessing this is compute device specific? Or is this only brand specific?).

So if I write code and optimized it for the Radeon 7750, would I be able to bring that code to the other computer with the Radeon 5870 and, without changing any part of the code, still retain a reasonable amount of performance benefits from the optimization? In the event that the code doesn't work, would changing parts of the code be a minor issue or would it involve rewriting enough code that it would have been a better idea to have written an optimized code for the Radeon 5870 in the first place.

K. Brafford · Accepted Answer

Without more information about the algorithms and applications you intend to write, the question is a little vague. But I think I can give you some high-level strategies to keep in mind as you develop your code for these two different platforms.

The Radeon 7750's design is of the new Graphics Core Next architecture, while your HD5780 is based on the older VLIW5 (RV770) Architecture.

For your code to perform well on the HD5780 hardware you must make as heavy use of the packed primitive datatypes as possible, especially the int4, float4 types. This is because the OpenCL compiler has a difficult time automatically discovering parallelism and packing data into the wide vectors for you. If you can structure your code so that you already have taken this into account, then you will be able to fill more of the VLIW-5 slots and thus use more of your stream processors.

GCN is more like NVidia's Fermi architecture, where the code's path to the functional units (ALUs, etc.) of the stream processors does not go through explicitly scheduled VLIW instructions. So more parallelism can be automatically detected at runtime and keep your functional units busy doing useful work without you having to think as hard about how to make that happen.

Here's an over-simplified example to illustrate my point:

// multiply four factors
// A[0] = B[0] * C[0]
// ...
// A[3] = B[3] * C[3];

float *A, *B, *C;

for (i = 0; i < 4; i ++) {
  A[i] = B[i] * C[i];
}

That code will probably run ok on a GCN architecture (except for suboptimal memory access performance--an advanced topic). But on your HD5870 it would be a disaster, because those four multiplies would take up 4 VLIW5 instructions instead of 1! So you would write that above code using the float4 type:

float4 A, B, C;

A = B * C;

And it would run really well on both of your cards. Plus it would kick ass on a CPU OpenCL context and make great use of MMX/SSE wide registers, a bonus. It's also a much better use of the memory system.

An a tiny nutshell, using the packed primitives is the one thing I can recommend to keep in mind as you start deploying code on these two systems at the same time.

Here's one more example that more clearly illustrates what you need to be careful doing on your HD5870. Say we had implemented the previous example using separate work-units:

// multiply four factors
// as separate work units
// A = B * C

float A, B, C;

A = B * C;

And we had four separate work units instead of one. That would be an absolute disaster on the VLIW device and would show tremendously better performance on the GCN device. That is something you will want to also look for when you are writing your code--can you use float4 types to reduce the number of work units doing the same work? If so, then you will see good performance on both platforms.

Portability and Optimization of OpenCL between Radeon Graphic Cards

Answers (1)

Related Questions