user1773603
user1773603

Reputation:

OpenCL benchmark - advice about parameters to vary

I would like to perform runtime benchmark about the two-stage Sum reduction with OpenCL ( from this AMD link) on a radeon HD 7970 Tahiti XT.

Initially, I took a first version of code where I didn't use the first loop which performs a reduction from an input array of size N to an output array of size NworkItems. Here's this first loop into kernel code :

  int global_index = get_global_id(0);
  float accumulator = 0;
  // Loop sequentially over chunks of input vector
  while (global_index < length) {
    float element = buffer[global_index];
    accumulator += element;
    global_index += get_global_size(0);
  }

So with this first version, I have measured the runtime as a function of input array size (which is equal to the total number of threads) and for different sizes of work group. Here's the results :

enter image description here

Now, I would like to do a benchmark where I use this initial loop above. But I don't know which parameters that I have to vary.

From this link, one says that AMD recommends a multiple of 64 for the size of a WorkGroup (32 for NVIDIA).

Moreover, from last comment on this other link, it is recommended to set the work group size like : WorkGroup size = (Number of total threads) / (Compute Units). On my GPU card, I have 32 compute units.

So I would like to get advices for knowing which parameters would be interesting to vary in order to compare runtimes in this second version (with the first reduction loop). For example, I may take different values for the ratio (N size of input array) / (total NworkItems) and a fixed value for WorkGroup size (see expression above),

or do on the contrary, i.e should I vary the value for WorkGroup size and fix the ratio (N size of input array) / (total NworkItems) ?

Upvotes: 1

Views: 129

Answers (1)

DarkZeros
DarkZeros

Reputation: 8410

You should sum local data instead of spread out data, to aid memory transfer (coalesced data access). So use this instead:

  int chunk_size = length/get_global_size(0)+(length%get_global_size(0) > 0); //Will give how many items each work item needs to process
  int global_index = get_group_id(0)*get_local_size(0)*chunk_size + get_local_id(0); //Start at this address for this work item
  float accumulator = 0;

  for(int i=0; i<chunk_size; i++)
    // Loop sequentially over chunks of input vector
    if (global_index < length) {
      float element = buffer[global_index];
      accumulator += element;
      global_index += get_local_size(0);
    }
  }

Also you should use sizes that are powers of two, to help caching.

Upvotes: 2

Related Questions