Reputation: 817
I work in kernel with large array of unsigned characters, I create memory object with clCreateBuffer. Than I copied through clEnqueueWriteBuffer a chunk of unsigned chars to this memory object. And than I call in cycle the kernel which read from this memory object, do some logic and write new data to the same place (I don't call clEnqueueWriteBuffer or clEnqueueReadBuffer in this cycle). Here is the kernel code:
__kernel void test(__global unsigned char *in, unsigned int offset) {
int grId = get_group_id(0);
unsigned char msg[1024];
offset *= grId;
// Copy from global to private memory
size_t i;
for (i = 0; i < 1024; i++)
msg[i] = in[ offset + i ];
// Make some computation here, not complicated logic
// Copy from private to global memory
for (i = 0; i < 1024; i++)
in[ offset + i ] = msg[i];
}
When the cycle is done (the cycle run cca 1000 times) then I read result from memory object through clEnqueueReadBuffer.
It is possible to optimize this code?
Upvotes: 2
Views: 3413
Reputation: 9886
Some suggestions:
in += get_group_id(0) * offset
at the beginning of the kernel.Upvotes: 2
Reputation: 2401
For optimization you need to explain what kind of calculations you do. The most benefit for performance can be gotten by grouping your calculations into work groups and to let them work on local memory. You need to pay a lot of attention to the size of your private memory (smallest) and local memory (small).
How often is your kernel called? Do all kernels use the same data? One could think of a local memory buffer where all threads in a work group read a part of the data into the local memory and share the data afterwards. You need to pay a little attention for synchronization.
I suggest having a look into the samples of the SDK vendor. I only know the nVidia SDK. The samples there are quite complex, but very interesting to read.
The change to vector types like float4shall be suitable for ATI boards. nVidia is said to be working best with scalars and internal compiler optimization. This is something for fine tuning later with a profiler. You can gain magnitudes of performance by memory optimization.
Upvotes: 0
Reputation: 915
Something that crosses the mind firstly, is that unrolling your loop can help you skip the condition evaluation. You can use this pragma to make it easier.
Using shared memories on Nvidia chips also could greatly help (if your current local mem is not using the shared memory by default)
Upvotes: 0
Reputation: 8036
You could try the vector version (uchar8 instead of uchar) but the compiler may optimize it this way anyway. Most important profile your code all the time and experiment.
edit
Seems even uchar16 is supported now: http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/vectorDataTypes.html
Upvotes: 1