anthonyvd
anthonyvd

Reputation: 7590

OpenCL program works on the CPU but outputs empty image when run on the GPU

I have the following OpenCL Kernel, a Gaussian Blur

__constant sampler_t sampler =
        CLK_NORMALIZED_COORDS_FALSE |
        CLK_ADDRESS_CLAMP_TO_EDGE |
        CLK_FILTER_NEAREST;

__constant float gaussian_kernel[3][3] = {
    {0.0625f, 0.125f, 0.0625f},
    {0.125f, 0.25f, 0.125f},
    {0.0625f, 0.125f, 0.0625f} };

void kernel gaussian_blur(
    read_only image2d_t input_image,
    write_only image2d_t output_image) {

    int x = get_global_id(0);
    int y = get_global_id(1);

    int2 coords[9] = {
        { x - 1, y - 1 }, { x, y - 1 }, { x + 1, y - 1 },
        { x - 1, y     }, { x, y     }, { x + 1, y     },
        { x - 1, y + 1 }, { x, y + 1 }, { x + 1, y + 1 }
    };

    float4 pixel_value = { 0.f, 0.f, 0.f, 0.f };

    for(int i = 0; i < 3; ++i) {
        for(int j = 0; j < 3; ++j) {
            int index = i * 3 + j;
            float4 blurred =
                as_float4(read_imageui(input_image, sampler, coords[index]));

            pixel_value.x += (blurred.x * gaussian_kernel[i][j]);
            pixel_value.y += (blurred.y * gaussian_kernel[i][j]);
            pixel_value.z += (blurred.z * gaussian_kernel[i][j]);
            pixel_value.w += (blurred.w * gaussian_kernel[i][j]);
        }
    }

    uint4 final_value = as_uint4(pixel_value);

    write_imageui(output_image, coords[4], final_value);
}

When I specify the device to use as the CPU, The blur works properly. Here's the device selection code

std::vector<cl::Platform> all_platforms;
cl::Platform::get(&all_platforms);

if(all_platforms.size() == 0) {
    std::cerr << "No platforms available" <<std::endl;
    exit(-1);
}

cl::Platform default_platform = all_platforms[0];

std::vector<cl::Device> all_devices;
default_platform.getDevices(CL_DEVICE_TYPE_ALL, &all_devices);

if(all_devices.size() == 0) {
    std::cerr << "No device found" << std::endl;
    exit(-1);
}

cl::Device default_device = all_devices[1]; //Changing this index to 0 uses my graphics card

Now, if the default_device is set to the GPU, only an empty image is output by the program. The relevant image setup code is (note that input is a Magick::Image and in_pixels a heap allocated array of unsigned short):

cl::ImageFormat format(CL_RGBA, CL_UNSIGNED_INT16);

cl::Image2D input_image_buffer;
input.write(0, 0, 
    input.baseColumns(), input.baseRows(), "RGBA", Magick::ShortPixel, in_pixels);

input_image_buffer = cl::Image2D(
    context, 
    CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
    format,
    input.baseColumns(),
    input.baseRows(),
    0,
    in_pixels,
    &cl_error);

cl::Image2D output_image_buffer;
output_image_buffer = cl::Image2D(
    context, 
    CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR,
    format, 
    input.baseColumns(),
    input.baseRows(),
    0,
    out_pixels,
    &cl_error);

And the Kernel setup/Image output code (gaussian_program was built without error of course)

cl::Kernel gaussian_kernel(gaussian_program, "gaussian_blur");

cl::CommandQueue queue(context, default_device, 0, &cl_error);

cl::size_t<3> origin;
cl::size_t<3> size;
origin[0] = 0;
origin[1] = 0;
origin[2] = 0;

size[0] = input.baseColumns();
size[1] = input.baseRows();
size[2] = 1;

cl_error = gaussian_kernel.setArg(0, input_image_buffer);

cl_error = gaussian_kernel.setArg(1, output_image_buffer);

cl::NDRange range(input.baseColumns(), input.baseRows());

cl_error = queue.enqueueNDRangeKernel(
    gaussian_kernel, 
    cl::NullRange, 
    range,
    cl::NullRange);

queue.finish();

try{
    output.read(
        input.baseColumns(), 
        input.baseRows(), 
        "RGBA", Magick::ShortPixel, out_pixels);
}
catch(Magick::Exception& ex) {
    std::cerr << "A Magick error occured while writing the pixel cache: " <<
        std::endl << ex.what() << std::endl;
    return false;
}

Now, I removed a lot of error checking for the purpose of this example, but the original code checks cl_error after every OpenCL call and never signals an error. The code executes as expected on the CPU but the image is empty when the code is executed on the GPU.

I suspected a synchronization issue at first (the queue.finish() call is required for that precise purpose, even on the CPU) but littering the code with cl::finish() or queue.finish() calls in an attempt to serialize the execution didn't help at all.

Is there something I'm clearly doing wrong? Is there a potential reason for this OpenCL kernel to fail on the GPU but not on the CPU?

For the record, I'm on Ubuntu 13.04 using the AMD APP SDK OpenCL implementation with a Radeon HD 7970.

Upvotes: 3

Views: 1764

Answers (2)

Eric Bainville
Eric Bainville

Reputation: 9886

As noted in my comment, as_float4 is not a conversion. It takes the 32 bits of the uint, and interprets them as float bits. In your case, you are reading 16 bits, so the float value will be a extremely small (exponent will be 0). Use convert_float4 instead.

The answer about not reading back the data is correct too. You need calls to clEnqueueReadBuffer or clEnqueueMapBuffer to make sure the data is read back from the device.

Upvotes: 2

Michael Haidl
Michael Haidl

Reputation: 5482

Well, where do you read the image back from your GPU?

You allocate your input image with the flag CL_MEM_COPY_HOST_PTR

This flag is valid only if host_ptr is not NULL. If specified, it indicates that the application wants the OpenCL implementation to allocate memory for the memory object and copy the data from memory referenced by host_ptr.

and your output image with CL_MEM_USE_HOST_PTR

This flag is valid only if host_ptr is not NULL. If specified, it indicates that the application wants the OpenCL implementation to use memory referenced by host_ptr as the storage bits for the memory object.

OpenCL implementations are allowed to cache the buffer contents pointed to by host_ptr in device memory. This cached copy can be used when kernels are executed on a device.

The result of OpenCL commands that operate on multiple buffer objects created with the same host_ptr or overlapping host regions is considered to be undefined.

there is nothing wrong with your allocation, however you never tell the OpenCL implementation to write the memory to the used device and to read that memory back into hosts main memory. This may work for CPUs because the memory is already in their device memory (the main memory) but fails for GPUs.

The OpenCL c++ binding offers cl::enqueueWriteImage(/*params*/); and cl::enqueueReadImage(/*params*/); to write and read image buffers to/from a device.

Upvotes: 6

Related Questions