Should be an easy one but my OpenCL skills are completely rusty. :) I have a simple kernel that does the sum of two arrays: __kernel void sum(__global float* a, __global float* b, __global float* c) { __private size_t gid = get_global_id(0); c[gid] = log(sqrt(exp(cos(sin(a[gid]))))) + log(sqrt(exp(cos(sin(b[gid]))))); } It's working fine. Now I'm trying to use local memory hoping it could speed things up: __kernel void sum_with_local_copy(__global float* a, __global float* b, __global float* c, __local float* tmpa, __local float* tmpb, __local float* tmpc) { __private size_t gid = get_global_id(0); __private size_t lid = get_local_id(0); __private size_t grid = get_group_id(0); __private size_t lsz = get_local_size(0); event_t evta = async_work_group_copy(tmpa, a + grid * lsz, lsz, 0); wait_group_events(1, &evta); event_t evtb = async_work_group_copy(tmpb, b + grid * lsz, lsz, 0); wait_group_events(1, &evtb); tmpc[lid] = log(sqrt(exp(cos(sin(tmpa[lid]))))) + log(sqrt(exp(cos(sin(tmpb[lid]))))); event_t evt = async_work_group_copy(c + grid * lsz, tmpc, lsz, 0); wait_group_events(1, &evt); } But there is two issues with this kernel: it's something like 3 times slower than the naive implementation the results are wrong starting at index 64 My local-size is the max workgroup size. So my questions are: 1) Am I missing something obvious or is there really a subtlety? 2) How to use local memory to speed up the computation? 3) Should I loop inside the kernel so that each work-item does more than one operation? Thanks in advance.

Using local memory to speed calculation

Reputation: 131986

Adding to what Kyle has written: It has to be multiple work items reading from the same address; if it's just each work item itself reading multiple times from the same address - then again local memory won't help you any; just use the work item's private memory, i.e. variables you define within your kernel.

Also, some points not related to the use of local memory:

log(sqrt(exp(x)) = log(exp(x)) / 2 = x / 2 ... assuming it's the natural logarithm.
log(sqrt(exp(x)) = log(exp(x)) / 2 = x / (2 ln(2)) ... assuming it's the base-2 logarithm. Compute ln(2) in advance of course.
If you really did have some complex function-of-a-function-of-a-function, you might be better off using a Taylor series expansion. For example, your function expands to 1/2-x^2/4+(5 x^4)/48+O(x^6) (order 5).

The last term is an error term, which you can bound from above to choose the appropriate order for the expansion; the error term should not be that high for 'well-behaving' functions. The Taylor expansion calculation might even benefit from further parallelization (but then again, it might not).

Upvotes: 1

Bruce Dean

Reputation: 2828

As you probably know you can explicitly set the local work group size (LWS) when executing your kernel using:

clEnqueueNDRangeKernel( ... bunch of args include Local Work Size ...);

as discussed here. But as already mentioned by Kyle, you don't really have to do this because OpenCL tries to pick the best value for the LWS when you pass in NULL for LWS argument.

Indeed the specification says: "local_work_size can also be a NULL value in which case the OpenCL implementation will determine how to be break the global work-items into appropriate work-group instances."

I was curious to see how this played out in your case so I setup your calculation to verify the performance against the default value chosen by OpenCL on my device.

In case your interested I setup some arbitrary data:

int n = powl(2, 20);
float* a = (float*)malloc(sizeof(float)*n);
float* b = (float*)malloc(sizeof(float)*n);
float* results = (float*)malloc(sizeof(float)*n);

for (int i = 0; i<n; i++) {
    a[i] = (float)i;
    b[i] = (float)(n-i);
    results[i] = 0.f;
}

and then after defining all of the other OpenCL structures I varied, lws = VALUE, from 2 to 256 (max allowed on my device for this kernel) in powers of 2, and measured the wall-clock time (note: can also use OpenCL events):

struct timeval timer;
int trials = 100;
gettimeofday(&timer, NULL);
double t0 = timer.tv_sec+(timer.tv_usec/1000000.0);

// ---------- Execution ---------
size_t global_work_size = n;
size_t lws[] = {VALUE}; // VALUE was varied from 2 to 256 in powers of 2.
for (int trial = 0; trial<trials; trial++) {
    clEnqueueNDRangeKernel(cmd_queue, kernel[0], 1, NULL, &global_work_size, lws, 0, NULL, NULL);
}
clFinish(cmd_queue);

gettimeofday(&timer, NULL);
double t1 = timer.tv_sec+(timer.tv_usec/1000000.0);
double avgTime = (double)(t1-t0)/trials/1.0f;

I then plotted the total execution time as a function of the LWS and as expected the performance varies by quite a bit, until the best value of LWS = 256, is reached. For LWS > 256, the memory on my device is exceeded with this kernel.

enter image description here

FYI for these tests I am running a laptop GPU: AMD ATI Radeon HD 6750M, Max compute units = 6 and the CL_DEVICE_LOCAL_MEM_SIZE = 32768 (so no big screamer compared other GPUs)

Here are the raw numbers:

   LWS     time(sec)
    2       14.004
    4       6.850
    8       3.431
   16       1.722
   32       0.866
   64       0.438
  128       0.436
  256       0.436

Next, I checked the default value chosen by OpenCL (passing NULL for the LWS) and this corresponds to the best value that I found by profiling, i.e., LWS = 256.

So in the code you setup you found one of the suboptimal cases, and as mentioned before, its best to let OpenCL pick the best values for the local work groups, especially when there is no shared data in your kernel between multiple work-items in a work-group.

As to the error you got, you probably violated a constraint (from the spec): The total number of work-items in the work-group must be less than or equal to the CL_DEVICE_MAX_WORK_GROUP_SIZE

Did you check that in detail, by querying the CL_DEVICE_MAX_WORK_GROUP_SIZE for your device?

Upvotes: 2

Kyle Lutz

Reputation: 8036

Your simple kernel is already optimal w.r.t work-group performance.

Local memory will only improve performance in cases where multiple work-items in a work-group read from the same address in local memory. As there is no shared data in your kernel there is no gain to be had by transferring data from global to local memory, thus the slow-down.

As for point 3, you may see a gain by processing multiple values per thread (depending on how expensive your computation is and what hardware you have).

Upvotes: 3

Using local memory to speed calculation

Answers (3)

Related Questions