Reputation: 51
I am trying to write a sorting function and a summation function in OpenCL/C++. However, while both functions work fine on smaller datasets, neither work on any dataset of any notable length. The dataset I'm trying to use is about 2 million entries long, but the functions stop working consistently at about 500. Any help on why this is would be appreciated. OpenCL code below.
EDIT: Only the code fully relevant to the sum is now shown (as per request).
kernel void sum(global const double* A, global double* B) {
int id = get_global_id(0);
int N = get_global_size(0);
B[id] = A[id];
barrier(CLK_GLOBAL_MEM_FENCE);
for (int i = 1; i < N/2; i *= 2) { //i is a stride
if (!(id % (i * 2)) && ((id + i) < N))
B[id] += B[id + i];
barrier(CLK_GLOBAL_MEM_FENCE);
}
}
And the C++ code:
std::vector<double> temps(100000, 1);
// Load functions
cl::Kernel kernel_sum = cl::Kernel(program, "sum");
// Set up variables
size_t elements = temps.size();
size_t size = temps.size() * sizeof(double);
size_t workgroup_size = 10;
size_t padding_size = elements % workgroup_size;
// Sum
if (padding_size) {
std::vector<double> temps_padding(workgroup_size - padding_size, 0);
temps.insert(temps.end(), temps_padding.begin(), temps_padding.end());
}
std::vector<double> temps_sum(elements);
size_t output_size = temps_sum.size() * sizeof(double);
cl::Buffer sum_buffer_1(context, CL_MEM_READ_ONLY, size);
cl::Buffer sum_buffer_2(context, CL_MEM_READ_WRITE, output_size);
queue.enqueueWriteBuffer(sum_buffer_1, CL_TRUE, 0, size, &temps[0]);
queue.enqueueFillBuffer(sum_buffer_2, 0, 0, output_size);
kernel_sum.setArg(0, sum_buffer_1);
kernel_sum.setArg(1, sum_buffer_2);
queue.enqueueNDRangeKernel(kernel_sum, cl::NullRange, cl::NDRange(elements), cl::NDRange(workgroup_size));
queue.enqueueReadBuffer(sum_buffer_2, CL_TRUE, 0, output_size, &temps_sum[0]);
double summed = temps_sum[0];
std::cout << "SUMMED: " << summed << std::endl;
I have tried looking around everywhere but I'm completely stuck.
Upvotes: 1
Views: 57
Reputation: 23438
You're trying to use barriers for synchronisation across work groups. This won't work. Barriers are for synchronising within work groups.
Work groups don't run in a well defined order relative to one another; you can only use this sort of reduction algorithm within a workgroup. You will probably need to use a second kernel pass to combine results from individual workgroups, or do this part on the host CPU. (Or modify your algorithm to use atomics in some way, etc.)
Upvotes: 3