Reputation: 10896
I am writing an OpenCL kernel that convolves an image with a 5x5 Gaussian filter and was wondering what is the best practice for storing the filter constants. In the kernel each thread in a 32x32 workgroup does the following:
__local
memory buffer, barrier(CLK_LOCAL_MEM_FENCE)
, Here are the buffers for the local image data and the filter:
__local float4 localRegion[32][32]; // image region w 2 pixel apron
....
static const float filter[5][5] = { // __constant vs __private ??
{1/256.0, 4/256.0, 6/256.0, 4/256.0, 1/256.0},
{4/256.0, 16/256.0, 24/256.0, 16/256.0, 4/256.0},
{6/256.0, 24/256.0, 36/256.0, 24/256.0, 6/256.0},
{4/256.0, 16/256.0, 24/256.0, 16/256.0, 4/256.0},
{1/256.0, 4/256.0, 6/256.0, 4/256.0, 1/256.0}
};
What memory regions can hold filter
, which is best, and how does initialization occur in each case? Optimally __private
would be best, but I am not sure you can statically initialize private array? __local
doesn't make sense unless some of the threads are responsible for loading the filter
entries (I think)? Also, according the the khronos docs Sec 6.5, I am not sure static
and _private
can go together.
According to the answers here and here, filter
can be stored as __private
but its not clear how initialization happens.
Upvotes: 3
Views: 2685
Reputation: 11916
but I am not sure you can statically initialize private array
Opencl spec says "The static storage-class specifier can only be used for non-kernel functions, global variables declared in program scope and variables inside a function declared in the global or constant address space.". On top of this, compiler(at least Amd's) optimizes constant math out and exchanges with simple (constant / instruction)memory accesses. Even on top of that, when space is not enough, private registers spill to global memory and kernel starts accessing there. So static can't have a meaningful description when real data is gone to somewhere else sometimes.
float filter[5][5] = {
{cos(sin(cos(sin(cos(sin(1/256.0f)))))), 4/256.0f, 6/256.0f, 4/256.0f, 1/256.0f},
{cos(sin(cos(sin(cos(sin(4/256.0f)))))), 16/256.0f, 24/256.0f, 16/256.0f, 4/256.0f},
{sin(cos(sin(cos(sin(cos(6/256.0f)))))), 24/256.0f, 36/256.0f, 24/256.0f, 6/256.0f},
{sin(cos(sin(cos(sin(cos(4/256.0f)))))), 16/256.0f, 24/256.0f, 16/256.0f, 4/256.0f},
{sin(cos(sin(cos(sin(cos(1/256.0f)))))), 4/256.0f, 6/256.0f, 4/256.0f, 1/256.0f}
};
takes same time(0.78ms for r7_240gpu) as
float filter[5][5] = {
{1/256.0f, 4/256.0f, 6/256.0f, 4/256.0f, 1/256.0f},
{4/256.0f, 16/256.0f, 24/256.0f, 16/256.0f, 4/256.0f},
{6/256.0f, 24/256.0f, 36/256.0f, 24/256.0f, 6/256.0f},
{4/256.0f, 16/256.0f, 24/256.0f, 16/256.0f, 4/256.0f},
{1/256.0f, 4/256.0f, 6/256.0f, 4/256.0f, 1/256.0f}
};
and ISA output of profiler doesn't have any sine or cosine functions. There are just some numbers written in some memory locations. This is the condition without any optimizations enabled.
What memory regions can hold filter, which is best
Depends on hardware but there are usually more than one type:
// defined before kernel
__constant float filter[5][5] = {
{1/256.0f, 4/256.0f, 6/256.0f, 4/256.0f, 1/256.0f},
{4/256.0f, 16/256.0f, 24/256.0f, 16/256.0f, 4/256.0f},
{6/256.0f, 24/256.0f, 36/256.0f, 24/256.0f, 6/256.0f},
{4/256.0f, 16/256.0f, 24/256.0f, 16/256.0f, 4/256.0f},
{1/256.0f, 4/256.0f, 6/256.0f, 4/256.0f, 1/256.0f}
};
this executed in same time for a r7_240 gpu. Note that static indexing is better for __constant
memory access(at least in amd gpu) and not bad with same-index accesses (all threads in a group access same index just like in this example(int the nested-loops)). Constant memory is faster than global memory with these addressing patterns but when varying indices are used, its not different than global memory access(even hits cache). "For globally scoped constant arrays, if the size of an array is below 64 kB, it is placed in hardware constant buffers; otherwise, it uses global memory". (There are Amd-GCN architecture related but similar behavior can be expected from Nvidia and Intel)
Amd's opencl spec says "L1 and L2 are enabled for images and same-indexed constants."(for HD5800 series gpu) So you can have similar performance using image2d_t inputs too. For GCN, L1 and L2 are faster than constant memory.
Nvidia's opencl best practices says: "p that read texture addresses that are close together will achieve best performance. Texture memory is also designed for streaming reads with a constant latency; that is, a cache hit reduces DRAM bandwidth demand, but not read latency. In certain addressing situations, reading device memory through image objects can be an advantageous alternative to reading device memory from global or constant memory. " and also says "They are cached, potentially exhibiting higher bandwidth if there is 2D locality in the texture fetches. "(image2d_t again)
You can even split filter if private memory is needed somewhere else, example:
// defined before kernel
__constant float filter2[3][5] = {
{1/256.0f, 4/256.0f, 6/256.0f, 4/256.0f, 1/256.0f},
{4/256.0f, 16/256.0f, 24/256.0f, 16/256.0f, 4/256.0f},
{6/256.0f, 24/256.0f, 36/256.0f, 24/256.0f, 6/256.0f},
};
// no need to write __private, automatically private in function body
float filter[2][5] = {
{4/256.0f, 16/256.0f, 24/256.0f, 16/256.0f, 4/256.0f},
{1/256.0f, 4/256.0f, 6/256.0f, 4/256.0f, 1/256.0f}
};
this has same timings with upper two examples(at least for r7_240). All examples were run for 512x512 sized images with 512x512 workitems with 16x16 local workitems.
__local doesn't make sense unless some of the threads are responsible for loading the filter entries
Local memory on Amd-GCN is 8x as fast as constant memory (same-index) accesses but has 5-20 times more capacity on whole GPU(but may be less for a single compute unit). Nvidia's opencl best practices says same. But HD5800 series amd gpu has much more constant memory bandwidth than local memory. GCN is newer so local memory seems better unless it doesn't have enough space.
Private registers on GCN is 5-6 times faster than local memory and capacity is 8 times of local memory per compute unit. So having something on private memory on GCN means ultimate performance unless that resource consumption stops enough wavefronts to launch(decreasing latency hiding).
Nvidia says similar thing too: "Generally, accessing a register consumes zero extra clock cycles per instruction, but delays may occur due to register read-after-write dependencies and register memory bank conflicts. The latency of read-after-write dependencies is approximately 24 cycles, but this latency is completely hidden on multiprocessors that have at least 192 active threads (that is, 6 warps). "
There was also some ghost wall loading into local memory:
Test gpu was r7_240 so it can work with only 16x16 local threads
so 20x20 area is loaded from global memory.
o: each work item's target pixel
-: needed ghost wall because of filter going out of bounds
x: ghost corner handled by single threads (yes,non optimized)
xx----------------xx
xx----------------xx
--oooooooooooooooo--
--oooooooooooooooo--
--oooooooooooooooo--
--oooooooooooooooo--
--oooooooooooooooo--
--oooooooooooooooo--
--oooooooooooooooo--
--oooooooooooooooo--
--oooooooooooooooo--
--oooooooooooooooo--
--oooooooooooooooo--
--oooooooooooooooo--
--oooooooooooooooo--
--oooooooooooooooo--
--oooooooooooooooo--
--oooooooooooooooo--
xx----------------xx
xx----------------xx
This kernel was used in upper profilings:
__constant float filter2[3][5] = {
{1/256.0f, 4/256.0f, 6/256.0f, 4/256.0f, 1/256.0f},
{4/256.0f, 16/256.0f, 24/256.0f, 16/256.0f, 4/256.0f},
{6/256.0f, 24/256.0f, 36/256.0f, 24/256.0f, 6/256.0f},
};
__kernel void test1(__global uchar4 *b2,__global uchar4 *b, __global int * p)
{
int j = get_local_id(0);
int g = get_group_id(0);
int gx=g%32;
int gy=g/32;
int lx=j%16;
int ly=j/16;
int x=gx*16+lx;
int y=gy*16+ly;
if(gx<2 || gx>29 || gy <2 || gy >29)
{
b2[((y * 512) + x)] = b[((y * 512) + x)];
return;
}
__local uchar4 localRegion[22][22];
localRegion[lx+2][ly+2]=b[((y * 512) + x)]; // interior
if(lx==0) // left edges
{
localRegion[1][ly+2]=b[(( (y) * 512) + x-1)]; // x-1 edge
localRegion[0][ly+2]=b[(( (y) * 512) + x-2)]; // x-2 edge
}
if(lx==15) // right edges
{
localRegion[18][ly+2]=b[(( (y) * 512) + x+1)]; // x+1 edge
localRegion[19][ly+2]=b[(( (y) * 512) + x+2)]; // x+2 edge
}
if(ly==0) // top edges
{
localRegion[lx+2][1]=b[(( (y-1) * 512) + x)]; // y-1 edge
localRegion[lx+2][0]=b[(( (y-2) * 512) + x)]; // y-2 edge
}
if(ly==15) // bot edges
{
localRegion[lx+2][18]=b[(( (y+1) * 512) + x)]; // y+1 edge
localRegion[lx+2][19]=b[(( (y+2) * 512) + x)]; // y+2 edge
}
if(lx==0 && ly==0) // upper-left square
{
localRegion[0][0]=b[(( (y-2) * 512) + x-2)];
localRegion[0][1]=b[(( (y-2) * 512) + x-1)];
localRegion[1][0]=b[(( (y-1) * 512) + x-2)];
localRegion[1][1]=b[(( (y-1) * 512) + x-1)];
}
if(lx==15 && ly==0) // upper-right square
{
localRegion[18][0]=b[(( (y-2) * 512) + x+1)];
localRegion[18][1]=b[(( (y-1) * 512) + x+1)];
localRegion[19][0]=b[(( (y-2) * 512) + x+2)];
localRegion[19][1]=b[(( (y-1) * 512) + x+2)];
}
if(lx==15 && ly==15) // lower-right square
{
localRegion[18][18]=b[(( (y+1) * 512) + x+1)];
localRegion[18][19]=b[(( (y+2) * 512) + x+1)];
localRegion[19][18]=b[(( (y+1) * 512) + x+2)];
localRegion[19][19]=b[(( (y+2) * 512) + x+2)];
}
if(lx==0 && ly==15) // lower-left square
{
localRegion[0][18]=b[(( (y+1) * 512) + x-2)];
localRegion[0][19]=b[(( (y+2) * 512) + x-2)];
localRegion[1][18]=b[(( (y+1) * 512) + x-1)];
localRegion[1][19]=b[(( (y+2) * 512) + x-1)];
}
barrier(CLK_LOCAL_MEM_FENCE);
float filter[2][5] = {
{4/256.0f, 16/256.0f, 24/256.0f, 16/256.0f, 4/256.0f},
{1/256.0f, 4/256.0f, 6/256.0f, 4/256.0f, 1/256.0f}
};
float4 acc=0;
for(int row=-2;row<=0;row++)
for(int col=-2;col<=2;col++)
{
uchar4 tmp=localRegion[lx+col+2][ly+row+2];
float tmp2=filter2[row+2][col+2];
acc+=((float4)(tmp2,tmp2,tmp2,tmp2)*(float4)((int)tmp.s0,(int)tmp.s1,(int)tmp.s2,(int)tmp.s3));
}
for(int row=1;row<=2;row++)
for(int col=-2;col<=2;col++)
{
uchar4 tmp=localRegion[lx+col+2][ly+row+2];
float tmp2=filter[row-1][col+2];
acc+=((float4)(tmp2,tmp2,tmp2,tmp2)*(float4)((int)tmp.s0,(int)tmp.s1,(int)tmp.s2,(int)tmp.s3));
}
b2[((y * 512) + x)] = (uchar4)(acc.x,acc.y,acc.z,244);
}
image was 512x512 with rgba (each channel 8-bit).
Source image(but resized to 512x512 before filtering as a substep):
Result image:
Documents I referenced:
Edit: if you really need __private, __local, __constant or __image2d_t memory for something else in the kernel, you can unroll filter loop completely, delete filter array, put those araray elements in the unrolled instructions yourself(I tried, it dropped VGPR usage to 21, SGPR usage to 16)
For reference, totally elliminating filter calculation decreases execution time by 0.05 milliseconds on average while all other versions take same higher amount of time.
Upvotes: 8