Reputation: 5088
Say I want to add two vectors of size 100 millions each, how should be best chosen global_work_size
?
global_work_size=100e6
, or10e6 x 10e6
, orObviously, my GPU card is far from having 100 millions work-items, or even processing elements, available at the same time.
More specifically, should global_work_size
generally be equal to the number of data to be processed in parallel?
Thanks
Upvotes: 0
Views: 400
Reputation: 2796
Unless I had different devices that I can split the work off to I would just enqueue the entire linear range all at once. The OpenCL driver will serially work through chunks that fit concurrently in hardware more efficiently than we can. Also there's a chance they will do it in such a way as to get best memory access behavior too.
Regarding choosing a 2D vs 1D range, I would not change my problem to be 2D just to enqueue a smaller range. However, if your problem is inherently two dimensional (e.g. 2d-image or 2d matrix), then it makes sense to represent it that way.
EDIT: mfa's points about looping within the kernel are a very valid way to compress a bigger range into a smaller one. This can also be very useful if your kernel is very short (just a few statements) since it reduces the startup cost of all those work hardware threads.
Upvotes: 1
Reputation: 5087
I your case, you should focus on choosing an optimal work group size for your device and have each work item process multiple elements. It is usually simpler to have the global number of work items equal the number of elements to be processed, but your problem is very large for a GPU, especially some of the low-end GPUs.
I doubt a 2D range will help you since your work is 1-dimensional by nature.
For example, use a group size or 256, where each work item processes 256 elements. A group would then be responsible for 65536 elements. The total would be 1526 groups consisting of 390656 individual work items.
The simple kernel from my answer to your other question will work fine. n is still your total number of elements. Both work item id and work group id are unused.
__kernel void vecAdd( __global double *a, __global double *b, __global double *c, const unsigned int n)
{
//Get our global thread ID and global size
int gid = get_global_id(0);
int gsize = get_global_size(0);
//check vs n using for-loop condition
for(int i=gid; i<n; i+= gsize){
c[i] = a[i] + b[i];
}
}
The kernel below will process a block of 65536 elements with each work group. This has the advantage of sharing global reads because they are adjacent memory addresses more often than the previous kernel. You should play around with ELEMENTS_PER_GROUP, as well as your work group size to find the optimal values for your hardware.
#define ELEMENTS_PER_GROUP 65536 //max # of elements to be processes by a group
__kernel void vecAddLocalized( __global double *a, __global double *b, __global double *c, const unsigned int n)
{
int wgId = get_group_id(0);
int wgSize = get_local_size(0);
int itemId = get_local_id(0);
int startIndex = wgId * ELEMENTS_PER_GROUP;
int endIndex = startIndex + ELEMENTS_PER_GROUP;
if(endIndex > n){
endIndex = n;
}
for(int i=startIndex + itemId; i<endIndex; i+= wgSize){
c[i] = a[i] + b[i];
}
}
Upvotes: 2