Reputation: 2860
I want to do some custom rendering in OpenCL
. One thing I need is a depth buffer. I have been searching quite a bit, but many solutions don't work anymore due to compiler optimizations.
Here is the basic kernel that draws points. If two points are at the same position I only want to draw the one closer to the camera.
__kernel void render_points(__global const uint4 *points, __global float *zbuffer, __global uint *img)
{
int i = get_global_id(0);
uint4 point = points[i];
int pos = point.y * WIDTH + point.x;
if (point.z < zbuffer[pos]) {
zbuffer[pos] = point.z;
img[pos] = point.w;
}
}
This simple approach however does not work, as due to parallelism and buffered write backs, the zbuffer is not updated immediately for all threads.
I am working with OpenCL 1.2, including the 32 bit atomic extentions.
How to implement a depth buffer?
Upvotes: 0
Views: 396
Reputation: 11920
A very long comment:
There is no atomicity support for user functions.
For single integer values, there is an atomic_min function which writes smaller of two values(1 in target address, 1 in parameter) to target address:
https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/atomic_min.html
int atomic_min (volatile __local int *p , int val)
if you multiply your float-type depth values by 1000 or 1000000, you could use this integer based function to have pixel values that are clamped to closest polygons behind them. But atomics are slow on global access. Maybe not so slow if each pixel is not accessed by many polygons. But still there would be an unordered access to main memory which is bad for gpu memory controller because the probability of multiple-data per access is low. To fix this partially, you could sort the opencl workitems (on their thread id versus their z-index (not zbuffer but the screen 2D position)) so that "maybe" compiler or hardware "joins"(dont know if possible) multiple parallel(and independent, contiguous like 1,2,3,4) atomics together for read/write.
But, since you want to switch two 32bit values rather than just depth, you should work on a 64bit integer(64 bit atomics are on OpenCL 2.0+ only) whose most significant part is the depth (to make atomic_min) and other half(img) is automatically switched. You need "64 bit atomics" for this:
https://www.khronos.org/registry/OpenCL/sdk/2.0/docs/man/xhtml/atom_min.html
long atom_min (volatile __local long *p, long val)
as long as you keep first(most significant) half of this value as the depth, it should do the "min" operation always on depth, but for exact same two depth values it would also do the switching with the "img" value. This would cause one of red or blue or green values to be always on top on zbuffer when two z values are exactly same. At least this could solve the flickering issue, if you are not using depth bias already.
Atomics can be really slow if thousands of polygons are behind same pixel. Best case would be 1 pixel per polygon(or whatever point it renders) and only with the z-order transforming of work indices. Sorting img on depths should make it a "stable" algorithm on performance where you would see same performance regardless of the scene topology.
But, if you somehow divide the screen into 256 squares and give each square an opencl workgroup (which will work on local memory, not global) then they would use atomic functions fast. Some architectures (like Nvidia Pascal) have really good local atomics performance. They may not even need atomics but some kind of wavefront synchronization to achieve high/acceptable performance. Can you divide the scene into squares and give each square its of polygons? Seems like an histogram(binning polygons on squares) problem but you can find a better solution perhaps. If I were using a GPU or many cores, I would try divide-and-conquer approach when implementing something that was on hardware.
Upvotes: 1
Reputation: 5746
I suspect the issue is not parallelism and atomics, but rather an unfavorable data type.
The depth buffer should be of a floating-point format, either float
(32-bit) or half
(16-bit). In your 3D to 2D conversion algorithm, the resulting x- and y-positions should be of type int
or short
(uchar
would limit the resolution to 256x256) while the resulting z-position should be floating-point. This way, when two points have similar z-positions of lets say 3.4
and 3.2
the closer one can still be drawn properly, while if you would use an integer data type the points would both have a depth of 3
and the on drawn lastly will dictate the pixel color.
Also for the img
buffer I would suggest the data type uint
in order to use full 32-bit colors.
The rest of your algorithm should work besides your data types. Below is my implementation of a OpenCL method which draws one pixel and checks for the z-buffer. For performance reasons, I use the data type half
for the z-buffer.
void __attribute__((always_inline)) draw(const int x, const int y, const float z, const uint color, global uint* bitmap, global half* zbuffer) {
if(x<0||x>=2*def_sceen_w||y<0||y>=2*def_sceen_h) return; // cancel drawing if point is off screen
const int index = y*2*def_sceen_w+x;
if(z<=vload_half(index,zbuffer)) return; // cancel drawing if point is behind zbuffer
vstore_half_rte(z,index,zbuffer);
bitmap[index] = color;
}
Upvotes: 1