Reputation: 1
Assume I have an OpenCL kernel where each workitem do one int_32 operation, and my GPU supports 256 bits SIMD operation, will OpenCL be able to pack 8 workitems together to take the advantage of SIMD? i.e. one processing unit do multiple workitems simultaneously. If so, then when will this happen? At "clBuildProgram" stage, or when the binary is actually executed on the GPU (JIT compilation)?
The second one seems more reasonable because this can only be decided after I define the workgroup size, for example, if I say 1 workitem per workgroup, then the vectorization cannot happen?
I looked at the Nvidia ptx file after "clBuildProgram" and I still saw scalar IR, but I'm not sure about Intel or AMD.
Upvotes: 0
Views: 622
Reputation: 20396
Generally speaking, if the GPU is going to perform SIMD instructions on your data, it'll decide that when your code is compiled (whether by an online compiler, or by an offline compiler). It's probably not going to decide that based on how/when you define your workgroups.
As for whether your data will be vectorized or not... That's a bit more complicated.
It depends on how exactly you've laid out your data and the logic of your kernel, as well as how much the (presumably online) compiler chooses to optimize your code. It ALSO depends heavily on the actual hardware, but I'll talk about that in a moment.
float4
, int4
, float8
, etc.) are the easiest to vectorize, and probably don't even require an optimization pass to do so, since the code is pretty explicitly saying "this data all belongs together and is (probably) going to have the same operations applied to it, so if you have the hardware to do it (but as I'll explain below, that's a rather big 'if') let's use the SIMD instructions for these types!"int
s called i1
, i2
, i3
, i4
, and they're all having the same operations applied to them, so let's SIMD them!". The big thing to all of this is to bear in mind that all of this depends on the Hardware capabilities of your card. At least among consumer-grade compute cards (translated: GPU's), the hardware engineers are actually not making significant upgrades to their vectorization capabilities, and in fact, are often choosing to cut back on vectorization to focus on making smaller cores which they can then stack more of onto the chip. It's a nice luxury, for example, to have a card with 128 cores, each of which can do 256-bit SIMD instructions, but oftentimes, it's a lot easier to just have a card with tiny cores which don't (or can't) handle SIMD instructions, and simply stack so many cores (like on NVidia's most recent launch, upwards of 4k) that can simply run in parallel, doing the same work (often faster) without depending on the programmer writing explicit SIMD instructions.
I do believe (but don't quote me on this) that both AMD and NVidia guarantee 128-bit vectorization for floats because float4
-type objects are extremely common in graphics programming, and if you're doing any kind of graphics processing (which is the norm for these kinds of applications) they'll benefit greatly from SIMD operations on those kinds of objects, but anything that's not that probably won't see any SIMD optimizations.
Upvotes: 2