Carmellose
Carmellose

Reputation: 5088

OpenCL: choosing the optimal global_work_size value

Say I want to add two vectors of size 100 millions each, how should be best chosen global_work_size?

Obviously, my GPU card is far from having 100 millions work-items, or even processing elements, available at the same time.

More specifically, should global_work_size generally be equal to the number of data to be processed in parallel?

Thanks

Upvotes: 0

Views: 400

Answers (2)

Tim
Tim

Reputation: 2796

Unless I had different devices that I can split the work off to I would just enqueue the entire linear range all at once. The OpenCL driver will serially work through chunks that fit concurrently in hardware more efficiently than we can. Also there's a chance they will do it in such a way as to get best memory access behavior too.

Regarding choosing a 2D vs 1D range, I would not change my problem to be 2D just to enqueue a smaller range. However, if your problem is inherently two dimensional (e.g. 2d-image or 2d matrix), then it makes sense to represent it that way.

EDIT: mfa's points about looping within the kernel are a very valid way to compress a bigger range into a smaller one. This can also be very useful if your kernel is very short (just a few statements) since it reduces the startup cost of all those work hardware threads.

Upvotes: 1

mfa
mfa

Reputation: 5087

I your case, you should focus on choosing an optimal work group size for your device and have each work item process multiple elements. It is usually simpler to have the global number of work items equal the number of elements to be processed, but your problem is very large for a GPU, especially some of the low-end GPUs.

I doubt a 2D range will help you since your work is 1-dimensional by nature.

For example, use a group size or 256, where each work item processes 256 elements. A group would then be responsible for 65536 elements. The total would be 1526 groups consisting of 390656 individual work items.

The simple kernel from my answer to your other question will work fine. n is still your total number of elements. Both work item id and work group id are unused.

__kernel void vecAdd(  __global double *a, __global double *b, __global double *c, const unsigned int n)
{                                           
  //Get our global thread ID and global size
  int gid = get_global_id(0);              
  int gsize = get_global_size(0);              

  //check vs n using for-loop condition
  for(int i=gid; i<n; i+= gsize){
    c[i] = a[i] + b[i];              
  }
}

The kernel below will process a block of 65536 elements with each work group. This has the advantage of sharing global reads because they are adjacent memory addresses more often than the previous kernel. You should play around with ELEMENTS_PER_GROUP, as well as your work group size to find the optimal values for your hardware.

#define ELEMENTS_PER_GROUP 65536 //max # of elements to be processes by a group

__kernel void vecAddLocalized(  __global double *a, __global double *b, __global double *c, const unsigned int n)
{                                           
  int wgId = get_group_id(0);
  int wgSize = get_local_size(0);
  int itemId = get_local_id(0);

  int startIndex = wgId * ELEMENTS_PER_GROUP;
  int endIndex = startIndex + ELEMENTS_PER_GROUP;
  if(endIndex > n){
    endIndex = n;
  }

  for(int i=startIndex + itemId; i<endIndex; i+= wgSize){
    c[i] = a[i] + b[i];              
  }
}

Upvotes: 2

Related Questions