Reputation: 15
I have two input images, which I pass to the kernel from Host. Dimensions of my images are 370x427.
I want to get my self familiarize with local memory, so i pass a local image to the kernel as well and try to copy the global image to local.
I am sending my image as 1D array.when i try to display the result it does not work.My global worksize is {width*height}
and I pass null
for local size clEnqueueNDRangeKernel
assuming opencl would choose appropriate size for local memory .
Below is my kernel code.
Please if someone could give me an hint.
__kernel void computeSSD(__global unsigned char *imgL,__global unsigned char *imgR,__global unsigned char *result,__global unsigned char *diff,int width,int MAX_DISP,int WIN_SIZE,__local unsigned char *localLeft,__local unsigned char *localRight )
{
int xCord=get_global_id(0);
int yCord=get_local_id(0);
// copy both images to local memory
localRight[yCord]= imgR[yCord];
localLeft[yCord] = imgL[yCord];
barrier(CLK_LOCAL_MEM_FENCE);
// do operation on local images
result[xCord]=localRight[yCord];
//
}
Upvotes: 1
Views: 787
Reputation: 8410
If you are filtering an image with a 3x3 filter. Every workgroup needs the workgroup pixels +1 margin to each side.
So your kernel can be something like:
__kernel filter(...){
int x_start = get_group_id(0)*get_local_size(0)-1;
int y_start = get_group_id(1)*get_local_size(1)-1;
int x_end = (get_group_id(0)+1)*get_local_size(0)+1;
int y_end = (get_group_id(1)+1)*get_local_size(1)+1;
__local mytype l[18][18]; //just an example for work sizes 16x16!
//Fetch
//Normally a direct operation per work item is preferred, since it is simpler and the driver will pack all the memory accesses together.
//So just follow coalesced access
//Using CL async workgroup copy has a complex sintax
for(int j=y_start+get_local_id(1); j<y_end; j+=get_local_size(1) ){
for(int i=x_start+get_local_id(0); i<x_end; i+=get_local_size(0) ){
l[j-y_start][i-x_start] = global[j][i];
}
}
barrier(CLK_GLOBAL_MEM);
//Use the memory for your filtering! (remember to substract y_start & x_start from your original indexes)
//....
}
Local memory is not a "yeah lets copy all to local, then use it so it will be faster". It is about copying just the region that the group needs, and reusing those reads locally, therefore avoiding lots of reads from global (ie: the 3x3 filter, 8 redundant reads from global, reducing the memory needs to 1/9th).
Also, local memory may be slower if you are not really reusing reads between work items. If each item just reads and writes a single memory location, local memory will make the kernel slower, not faster.
Upvotes: 1