Reputation: 3232
I thought that a compute accelerator (GPU) is a some bunch of SPs - "Stream Processors", each of which conists of some fixed numbers of ALU cores, operating in SIMD manner. But unlike cpu threads, SPs fires together, with some stride. That's also called coalescing.
So, for example, I have arrays A, B and C. I would know and manage their internal structure (be it 1D or 5D arrays) - That is not of GPU interest. I just tell it like - "Take this couple of read-only memories A and B. Take one that write-only memory C. Perform some sequence of instructions N times."
GPU, which "best knows" its internal "SP" (or "CU") count, and caches, could just take that and cut task in same blocks.
So front side of the coin is that every DRAM is FLAT. So everything in PC is one dimensional in nature. I dont understand what are 2D, 3D ranges and what are they used for. Cannot we just use 1D everywhere?
On the other side, lets assume that this is done, because openCL approach claims to be very flexible, to even force us provide internal arrays structure to it. Now I have 42-dimensional data! Why it is unsupported but only 3 dimensions supported?
So what are local, global groups, ndranges dimensions, and how to calculate them?
Could you provide some example, where multidimensional ranges are crucial, or at least beneficial to use? How do you split to local,cache, and global sizes?
Here is list of parameters that I dont understand, and totally messed out in that:
CL_DEVICE_GLOBAL_MEM_CACHE_SIZE
CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE
CL_DEVICE_LOCAL_MEM_SIZE
CL_DEVICE_MAX_WORK_ITEM_SIZES
CL_DEVICE_IMAGE2D_MAX_HEIGHT
CL_DEVICE_IMAGE3D_MAX_HEIGHT
CL_DEVICE_MAX_SAMPLERS
CL_DEVICE_MAX_COMPUTE_UNITS
Is there some generic formula on how average programmer can use them, just to be sure that his job will be splitted on any GPU efficient enough?
Upvotes: 2
Views: 1762
Reputation: 1119
But unlike cpu threads, SPs fires together, with some stride. That's also called coalescing.
You're mixing two different concepts here: memory access(/coalescing) and program execution. Every PE in a CU executes the same instruction in lockstep (at least on most GPUs, there are some exceptions), but any stride or coalescing is up to the programmer. For example, i can write a kernel that runs with global work size 1000, but all 1000 work-items will access only 10 bytes of memory in total. Or just 1 byte. Or 10 megabytes in a random order. Memory access / coalescing is independent of program execution ranges (global/local work sizes). IOW the local/global ranges specify how many instances of your kernel will be launched; but how each instance accesses memory has nothing to do with it.
DRAM is FLAT. So everything in PC is one dimensional in nature. I dont understand what are 2D, 3D ranges and what are they used for. Cannot we just use 1D everywhere?
Again, ranges have nothing to do with memory. As to why there are 2D/3D ranges:
Lets say you have a 800x600 2D image and you want to run a sobel filter. If you only have 1D range, you could run your kernel on each pixel with 1D global size 480000. But the sobel filter requires pixels from previous and next image lines. So you'd have to recalculate x and y of the current pixel from the 1D value - and that requires a division and a modulo. Both are slow and you need to do that for every pixel. The point of having 2D/3D ranges is that the "get_global_id" and friends are hardware-accelerated. Usually some hardware in GPU (scheduler, CU or PE) keep track of x,y,z of the currently executing work item in some special registers, and get_global_id translates to a single instruction which reads the registers.
42-dimensional data! Why it is unsupported but only 3 dimensions supported?
Because the GPU architects didn't see a point in accelerating get_global_id & friends for more than 3 dimensions.
Upvotes: 0
Reputation: 1103
Ok, I'll try to explain this as best as I can, but you have asked a whole bunch of questions in a single post and it seems you lack the basic abstractions of OpenCL.
Host: the host is what decides what happens with OpenCL. That's the processor that run your program.
Compute device: this is the hardware on which your code will run. A graphics card (GPU) is a single device, and so is a multi-core CPU. Should you have two GPUs on your machine, you could have your program run on both devices at the same time.
Compute unit: Within a device, all the cores (CUDA cores for Nvidia, Stream processors for AMD) are split into groups that share a common local memory. Each compute unit can be seen conceptually as a small SIMD processor. Group sizes vary from one device to another but it's typically 32 or 64. (For my GTX 970, I have 1664 CUDA cores in 13 compute units so that's 128). I don't think there's a direct way to query that with clGetDeviceInfo
, but you can easily figure it out for a given graphics card.
Processing element: This is how we name a single core within the GPU. They do not have any memory, just registers. Note that at any given time, every processing element of the same compute unit will run the same program in synch. If there is a bunch of logic (if/else
) statements in your code and some processing elements takes a different branch than the others, all the others will wait doing nothing.
Program: that's more or less clear even to me. It's your program that is loaded in memory and needs to be built/compiled. A program might contain multiple functions (kernels) that are called individually.
Kernel: simply put, that's the function that you will be running on your device. An instance of a kernel will run on a processing element. There are many, many instances of your kernel running at the same time. Within the kernel, it is possible to get some information about the processing element it's being run on. That is done by some basic functions that are closely tied with the parameters of clEnqueueNDRangeKernel (see below).
Memory: In terms of memory, each compute device (GPU) has a global memory that you can write to from the host. Then, each compute unit will have a limited amount of local memory (CL_DEVICE_LOCAL_MEM_SIZE) that are shared by the processing elements of that compute unit. There are a number of limitations with respect to the size of buffers you can allocate, but usually that's not a problem. You can query the different CL_DEVICE_x parameters to get those numbers. The global memory has a 'constant' part but I won't discuss it as it won't bring anything to the discussion.
When you want to perform calculations on a GPU, you need a kernel and some buffers in GPU memory. The host (CPU) should transfer memory into the buffer in global memory. It should also set the arguments required by the kernel. Then, it should tell the GPU to invoke the kernel using clEnqueueNDRangeKernel. This function has quite a few parameters...
globalWorkSize: the number of times your kernel has to run to solve your problem, per dimension. The number of dimensions is arbitrary, the standard says the compute device needs to support at least 3 dimensions, but some GPU might support more. It doesn't really matter, as any ND problem can be split into multiple 1D problems.
localWorkSize: this is the size of work that is performed by a compute unit, per dimension. Usually, you want to use a value that corresponds to the number of processing elements in your compute units (usually 32 or 64, see above). Note that localWorkSize must divide globalWorkSize evenly. 0 == (globalWorkSize % localWorkSize)
.
Let's put this into an example. Say I have a 1D array of 1024 numbers and I just want to square each value in that array. The globalWorkSize is 1024 because I want each number to be processed independently, and I would set the localWorkSize to the highest count of processing elements in my compute unit that evenly divides 1024 (I will use 128 for my GTX970). My problem is 1 dimension so I'll write 1 to that parameter.
Keep in mind that if you use a smaller (or higher) number than the number of processing elements in your compute units, the others will just burn clock cycles doing nothing. I could have said I want a localWorkSize of 2, but then each compute unit would have wasted 126/128 processing elements and that's not really efficient.
By setting globalWorkSize = 1024
and localWorkSize = 128
, I just told my GPU to run the kernel 1024 times on (1024/128 = 8) compute units. I'll have 1024 processing elements (CUDA cores) each performing the operation on 1 element of my buffer.
Now, how does the processing element know what value it must calculate in my buffer? That's where the work-item functions cost into play.
There are a few of them, but for this example I will only care about get_global_id(uint nDimensions)
. It returns the global id for a given dimension based on the globalWorkSize. In our case, our problem is 1d, so get_global_id(0)
will return an index between [0, globalWorkSize]. The index is different for every processing element.
Sample kernel:
__kernel MakeSquared(__global double* values) {
size_t idx = get_global_id(0);
values[idx] = values[idx] * values[idx];
}
EDIT: example with local memory usage:
__kernel MakeSquared(__global double* values, __local double* lValues) {
size_t idx = get_global_id(0);
size_t localId = get_local_id(0);
lValues[localId] = values[idx];
// potentially some complex calculations
lValues[localId] = lValues[localId] * lValues[localId];
values[idx] = lValues[localId];
}
There is so much that remains to say, but I think I've covered the basics.
Upvotes: 7