Reputation: 1

OpenCL vectorization on adjacent workitems

Assume I have an OpenCL kernel where each workitem do one int_32 operation, and my GPU supports 256 bits SIMD operation, will OpenCL be able to pack 8 workitems together to take the advantage of SIMD? i.e. one processing unit do multiple workitems simultaneously. If so, then when will this happen? At "clBuildProgram" stage, or when the binary is actually executed on the GPU (JIT compilation)?

The second one seems more reasonable because this can only be decided after I define the workgroup size, for example, if I say 1 workitem per workgroup, then the vectorization cannot happen?

I looked at the Nvidia ptx file after "clBuildProgram" and I still saw scalar IR, but I'm not sure about Intel or AMD.

Upvotes: 0

Answers (1)

Xirema

Reputation: 20396

Generally speaking, if the GPU is going to perform SIMD instructions on your data, it'll decide that when your code is compiled (whether by an online compiler, or by an offline compiler). It's probably not going to decide that based on how/when you define your workgroups.

As for whether your data will be vectorized or not... That's a bit more complicated.

It depends on how exactly you've laid out your data and the logic of your kernel, as well as how much the (presumably online) compiler chooses to optimize your code. It ALSO depends heavily on the actual hardware, but I'll talk about that in a moment.

Vector data types (like float4, int4, float8, etc.) are the easiest to vectorize, and probably don't even require an optimization pass to do so, since the code is pretty explicitly saying "this data all belongs together and is (probably) going to have the same operations applied to it, so if you have the hardware to do it (but as I'll explain below, that's a rather big 'if') let's use the SIMD instructions for these types!"
Scalar data types are probably not going to be optimized unless you have a really smart compiler. Not every compiler is going to work out "Well, you have ints called i1, i2, i3, i4, and they're all having the same operations applied to them, so let's SIMD them!".
Scalar data types within workgroups are almost certainly not going to be vectorized. They'll still be executed concurrently (because if not, then why the hell are we even writing GPGPU code in the first place????) but compilers and runtimes will almost certainly not be able to optimize around them.
EDIT: As pointed out, there are Compiler Tricks that can make this kind of vectorization possible. But it's worth bearing in mind that these tricks occur at compile-time, not at runtime, which means it's highly dependent on how the code is written, and which compiler (and which optimization flags, if they exist) is used to compile the kernel code.

The big thing to all of this is to bear in mind that all of this depends on the Hardware capabilities of your card. At least among consumer-grade compute cards (translated: GPU's), the hardware engineers are actually not making significant upgrades to their vectorization capabilities, and in fact, are often choosing to cut back on vectorization to focus on making smaller cores which they can then stack more of onto the chip. It's a nice luxury, for example, to have a card with 128 cores, each of which can do 256-bit SIMD instructions, but oftentimes, it's a lot easier to just have a card with tiny cores which don't (or can't) handle SIMD instructions, and simply stack so many cores (like on NVidia's most recent launch, upwards of 4k) that can simply run in parallel, doing the same work (often faster) without depending on the programmer writing explicit SIMD instructions.

I do believe (but don't quote me on this) that both AMD and NVidia guarantee 128-bit vectorization for floats because float4-type objects are extremely common in graphics programming, and if you're doing any kind of graphics processing (which is the norm for these kinds of applications) they'll benefit greatly from SIMD operations on those kinds of objects, but anything that's not that probably won't see any SIMD optimizations.

Upvotes: 2

OpenCL vectorization on adjacent workitems

Answers (1)

Related Questions