Understanding work-items and work-groups

Question

Based on my previous question:

I'm still trying to copy an image (no practical reason, just to start with an easy one):

The image contains 200 * 300 == 60000 pixels.

The maximum number of work-items is 4100 according to CL_DEVICE_MAX_WORK_GROUP_SIZE.

kernel1:

std::string kernelCode =
            "void kernel copy(global const int* image, global int* result)"
            "{"
                "result[get_local_id(0) + get_group_id(0) * get_local_size(0)] = image[get_local_id(0) + get_group_id(0) * get_local_size(0)];"
            "}";

queue:

for (int offset = 0; offset < 30; ++offset)
        queue.enqueueNDRangeKernel(imgProcess, cl::NDRange(offset * 2000), cl::NDRange(60000));
queue.finish();

Gives segfault, what's wrong?

With the last parameter cl::NDRange(20000) it doesn't, but gives back only part of the image.

Also I don't understand, why I can't use this kernel:

kernel2:

std::string kernelCode =
            "void kernel copy(global const int* image, global int* result)"
            "{"
                "result[get_global_id(0)] = image[get_global_id(0)];"
            "}";

Looking at this presentation on the 31th slide:

Why can't I just simply use the global_id?

EDIT1

Platfrom: AMD Accelerated Parallel Processing

Device: AMD Athlon(tm) II P320 Dual-Core Processor

EDIT2

The result based on huseyin tugrul buyukisik's answer:

enter image description here

EDIT3

With the last parameter cl::NDRange(20000):

enter image description here

Kernel is both ways the first one.

EDIT4

std::string kernelCode =
                "void kernel copy(global const int* image, global int* result)"
                "{"
                    "result[get_global_id(0)] = image[get_global_id(0)];"
                "}";
//...
cl_int err;
    err = queue.enqueueNDRangeKernel(imgProcess, cl::NDRange(0), cl::NDRange(59904), cl::NDRange(128));

    if (err == 0)
        qDebug() << "success";
    else
    {
        qDebug() << err;
        exit(1);
    }

Prints success.

Maybe this is wrong?

int size = _originalImage.width() * _originalImage.height();
int* result = new int[size];
//...
cl::Buffer resultBuffer(context, CL_MEM_READ_WRITE, size);
//...
queue.enqueueReadBuffer(resultBuffer, CL_TRUE, 0, size, result);

The guilty was:

cl::Buffer imageBuffer(context, CL_MEM_USE_HOST_PTR, sizeof(int) * size, _originalImage.bits());
cl::Buffer resultBuffer(context, CL_MEM_READ_ONLY, sizeof(int) * size);
queue.enqueueReadBuffer(resultBuffer, CL_TRUE, 0, sizeof(int) * size, result);

I used size instead of sizeof(int) * size.

huseyin tugrul buyukisik · Accepted Answer

Edit 2:

Try non constant memory specifier please(maybe not compatible with your cpu):

std::string kernelCode =
            "__kernel void copy(__global int* image, __global int* result)"
            "{"
                "result[get_global_id(0)] = image[get_global_id(0)];"
            "}";

also you may need to change buffer options too.

Edit:

You have forgotten three '__'s before 'global' and 'kernel' specifiers so please try:

std::string kernelCode =
            "__kernel void copy(__global const int* image, __global int* result)"
            "{"
                "result[get_global_id(0)] = image[get_global_id(0)];"
            "}";

Total elements are 60000 but you are doing an offset+60000 which overflows and reads/writes unprivilaged areas.

The usual usage of ndrange for opencl 1.2 c++ bindings must be:

cl_int err;
err=cq.enqueueNDRangeKernel(kernelFunction,referenceRange,globalRange,localRange);

Then check err for the real error code you seek. 0 means succeess.**

If you want to divide work into smaller parts you should cap the range of each unit by 60000/N

If you divide by 30 parts, then

for (int offset = 0; offset < 30; ++offset)
        queue.enqueueNDRangeKernel(imgProcess, cl::NDRange(offset * 2000), cl::NDRange(60000/30));
queue.finish();

And double check the size of each buffer e.g. sizeof(cl_int)*arrElementNumber

Becuase size of an integer may not be same for the device integer. You need 60000 elements? Then you need 240000 bytes to pass as size when creating buffer.

For compatibility, you should check for size of an integer before creating buffers if you are up to run this code on another machine.

You may know this already but Im gonna tell anyway:

CL_DEVICE_MAX_WORK_GROUP_SIZE

is number of threads that can share local/shared memory in a compute unit. You dont need to divide your work just for this. Opencl does this automatically and gives a unique global id for each thread along whole work, and gives unique local id for each thread in a compute unit. If CL_DEVICE_MAX_WORK_GROUP_SIZE is 4100 than it can create threads that share same variables in a compute unit. You can compute all 60000 variables in a single sweep with just an adition: multiple workgroups are created for this and each group has a group id.

  // this should work without a problem
  queue.enqueueNDRangeKernel(imgProcess, cl::NDRange(0), cl::NDRange(60000));

If you have an AMD gpu or cpu and if you are using msvc, you can install codexl from amd site and choose system info from drop-down menu to look at relevant numbers.

Which device is that of yours? I couldnt find any device with a max work group size of 4100! My cpu has 1024, gpu has 256. Is that a xeon-phi?

For example total work items can be as big as 256*256 times work group size here. enter image description here

Codexl has other nice features such as performance profiling, tracing code if you need maximum performance and bugfixing.

Understanding work-items and work-groups

Answers (1)

Related Questions