Reputation:
I would like to perform runtime benchmark about the two-stage Sum reduction with OpenCL ( from this AMD link) on a radeon HD 7970 Tahiti XT.
Initially, I took a first version of code where I didn't use the first loop which performs a reduction from an input array of size N
to an output array of size NworkItems
. Here's this first loop into kernel code :
int global_index = get_global_id(0);
float accumulator = 0;
// Loop sequentially over chunks of input vector
while (global_index < length) {
float element = buffer[global_index];
accumulator += element;
global_index += get_global_size(0);
}
So with this first version, I have measured the runtime as a function of input array size (which is equal to the total number of threads) and for different sizes of work group. Here's the results :
Now, I would like to do a benchmark where I use this initial loop above. But I don't know which parameters that I have to vary.
From this link, one says that AMD recommends a multiple of 64 for the size of a WorkGroup (32 for NVIDIA).
Moreover, from last comment on this other link, it is recommended to set the work group size like : WorkGroup size = (Number of total threads) / (Compute Units)
.
On my GPU card, I have 32 compute units.
So I would like to get advices for knowing which parameters would be interesting to vary in order to compare runtimes in this second version (with the first reduction loop). For example, I may take different values for the ratio (N size of input array) / (total NworkItems)
and a fixed value for WorkGroup size
(see expression above),
or do on the contrary, i.e should I vary the value for WorkGroup size
and fix the ratio (N size of input array) / (total NworkItems)
?
Upvotes: 1
Views: 129
Reputation: 8410
You should sum local data instead of spread out data, to aid memory transfer (coalesced data access). So use this instead:
int chunk_size = length/get_global_size(0)+(length%get_global_size(0) > 0); //Will give how many items each work item needs to process
int global_index = get_group_id(0)*get_local_size(0)*chunk_size + get_local_id(0); //Start at this address for this work item
float accumulator = 0;
for(int i=0; i<chunk_size; i++)
// Loop sequentially over chunks of input vector
if (global_index < length) {
float element = buffer[global_index];
accumulator += element;
global_index += get_local_size(0);
}
}
Also you should use sizes that are powers of two, to help caching.
Upvotes: 2