polarisfox64
polarisfox64

Reputation: 51

OpenCL code not working on larger datasets

I am trying to write a sorting function and a summation function in OpenCL/C++. However, while both functions work fine on smaller datasets, neither work on any dataset of any notable length. The dataset I'm trying to use is about 2 million entries long, but the functions stop working consistently at about 500. Any help on why this is would be appreciated. OpenCL code below.

EDIT: Only the code fully relevant to the sum is now shown (as per request).

kernel void sum(global const double* A, global double* B) {
    int id = get_global_id(0);
    int N = get_global_size(0);

    B[id] = A[id];

    barrier(CLK_GLOBAL_MEM_FENCE);

    for (int i = 1; i < N/2; i *= 2) { //i is a stride
        if (!(id % (i * 2)) && ((id + i) < N)) 
            B[id] += B[id + i];

        barrier(CLK_GLOBAL_MEM_FENCE);
    }
}

And the C++ code:

        std::vector<double> temps(100000, 1);

        // Load functions
        cl::Kernel kernel_sum = cl::Kernel(program, "sum");

        // Set up variables
        size_t elements = temps.size();
        size_t size = temps.size() * sizeof(double);
        
        size_t workgroup_size = 10;
        size_t padding_size = elements % workgroup_size;

        // Sum
        if (padding_size) {
            std::vector<double> temps_padding(workgroup_size - padding_size, 0);
            temps.insert(temps.end(), temps_padding.begin(), temps_padding.end());
        }

        std::vector<double> temps_sum(elements);
        size_t output_size = temps_sum.size() * sizeof(double);

        cl::Buffer sum_buffer_1(context, CL_MEM_READ_ONLY, size);
        cl::Buffer sum_buffer_2(context, CL_MEM_READ_WRITE, output_size);

        queue.enqueueWriteBuffer(sum_buffer_1, CL_TRUE, 0, size, &temps[0]);
        queue.enqueueFillBuffer(sum_buffer_2, 0, 0, output_size);

        kernel_sum.setArg(0, sum_buffer_1);
        kernel_sum.setArg(1, sum_buffer_2);
        
        queue.enqueueNDRangeKernel(kernel_sum, cl::NullRange, cl::NDRange(elements), cl::NDRange(workgroup_size));
        queue.enqueueReadBuffer(sum_buffer_2, CL_TRUE, 0, output_size, &temps_sum[0]);
        double summed = temps_sum[0];

        std::cout << "SUMMED: " << summed << std::endl;

I have tried looking around everywhere but I'm completely stuck.

Upvotes: 1

Views: 57

Answers (1)

pmdj
pmdj

Reputation: 23438

You're trying to use barriers for synchronisation across work groups. This won't work. Barriers are for synchronising within work groups.

Work groups don't run in a well defined order relative to one another; you can only use this sort of reduction algorithm within a workgroup. You will probably need to use a second kernel pass to combine results from individual workgroups, or do this part on the host CPU. (Or modify your algorithm to use atomics in some way, etc.)

Upvotes: 3

Related Questions